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DATA MINING PLATFORM FOR BIOINFORMATICS AND 
OTHER KNOWLEDGE DISCOVERY 

RELATED APPLICATIONS 

5 The present application claims the priority of each of the following U.S. 

provisional patent applications: Serial No. 60/298,842, Serial No. 60/298,757, 
and Serial No. 60/298,867, all filed June 15, 2001, and, for U.S. national stage 
purposes, is a continuation-in-part PCT application Serial No. PCT/US02/16012, 
which was filed in the U.S. Receiviag Of&ce on May 20, 2002, which is a 

10 continuation-in-part of U.S. Patent AppUcation Serial No. 10/057,849, filed 
January 24, 2002, which is a continuation-in-part of application Serial No. 
09/633,410, filed August 7, 2000, which is a continuation-in-part of s^pUcation 
Serial No. 09/578,011, filed May 24, 2000, which is a continuation-in-part of 
application Serial No. 09/568,301, filed May 9, 2000, now issued as Patent No. 

15 , which is a continuation of application Serial No. 09/303,387. 

filed May 1, 1999, now issued as Patent No. 6,128,608, which claims priority to 
U.S. provisional application Serial No. 60/083,961, filed May 1, 1998. TWs 
application is related to co-pending applications Serial No. 09/633,615, Serial 
No. 09/633,616, and Serial No. 09/633,850, aU filed August 7, 2000, which are 

20 also continuations-in-part of apphcation Serial No. 09/578,01 1 . This apphcation 
is also related to appUcations Serial No. 09/303,386 and Serial No. 09/305,345, 
now issued as Patent No. 6,157,921, both filed May 1, 1999, and to application 
Serial No. 09/715,832, filed November 14, 2000, all of which also claim priority 
to provisional application Serial No. 60/083,961. Each of the above-identified 

25 q>pUcations is incorporated herein by reference. 

FIELD OF THE INVENTION 
The present invention relates to the use of learning machines to identify 
relevant patterns in datasets containing large quantities of diverse data, and more 
30 particularly to a computational platform for extraction of data from multiple, 
diverse sources for identification of relevant patterns in biological data. 
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particularly to a computational platform for extraction of data fiom multiple, 
diverse sources for identification of relevant patterns in biological data. 

BACKGROUND OF THE INVENTION 

Currently, most innovations in diagnosis and in therapy remain within the 
5 fi:amework of morphology (e.g. flie stady of tumor shapes), physiology (tiie study 
o f organ function), and chemistry. 

With tiie advent o f molecular biology and molecular genetics, medicine 
and pharmacology have entered the information age. Information technology, 
which has been so widely ^pUed to the imderstanding of human intelligence 

10 (artificial intelligmce, neural networks), telecommunications, and the Intemet, 
should be q^phcable to the study of the program of life. 

Disease used to be understood as the intrusion of foreign agents (e.g., 
bacteria) that should be deleted, or as a chemical imbalance that should be 
conqpensated. In the genomic era, diseases are interpreted as a deficiency of ttie 

IS genetic program to ad^t to its environment caused by missing, lost, exaggerated 
or corrupted genetic informatiorL We are moving towards an age when disease 
and disease susceptibility will be described and remedied not only in terms of 
then: symptoms (phenotype), but in term of their cause: external agents and 
genetic malfimction (genotype). 

20 A great deal of effort of the pharmaceutical industry is presently being 

directed toward detecting the genetic malfimction (diagnosis) and correcting it 
(cxne), using the tools of modem genomic and biotechnology. Correcting a 
genetic malfunction can occur at the DNA level using gene ther^y. The 
replacemrat of destroyed tissues due to, e.g., arthrosis, heart disease, or neuio- 

25 deg^oration, could be achieved be activating natural regeneration processes, 
following a similar mechanism as that of embryonic development 

Most genes, when activated, yield the production of one or several 
specific proteins. Acting on proteins are projected to be the domain of modem 
drag therapy. There are two complementary ways of acting on proteu3s: (1) the 

30 concentration of proteins soluble in serum can be modified by using ik&ai directly 
as drugs; (2) chemical compounds that interact selectively with given proteins can 
be used as drugs. 

It has been estimated that between 10,000 and 15,000 human genes code 
for soluble proteins. If only a small percentage of these proteins have a 
35 therapeutic effect, a considerable number of new medicinal substances based on 



02/103954 



3 



PCT/USp2/19202 



proteins remain to be found. Presently, approximately 100 proteins are used as 
medicines. 

All of today's drugs that are known to be safe and effective are directed at 
approximately 500 target molecules. Most drug targets are either enzymes (22%) 
or receptors (52%). En^mes are proteins responsible for activating certain 
chemical reactions (catalysts). Enzyme inhibitors can, for exanq)le, halt cell 
reproduction for purposes of figihttng bactmal infection. The inhibition of 
enzymes is one of the most successful strategies for finding new medicines, one 
example of which is the use of reverse transcriptase inhibitors to fi^t the 
infectiol,4 by the retrovirus of HIV. Receptors can be defined as proteins that 
form stable bonds with ligands such as hormones or neurotransmitters. Receptors 
can serve as "docking stations" for toxic substances to selectively poison 
parasites or tumor cells (chemother^y). In the pharmacological definition, 
receptors are stimuli or signal transceivers. Blocking a receptor such as a 
neurotransmitter receptor, a hormone recq)tor or an ion channel alters the 
functioning of the cell. Since the 1950*s, many successful drugs which function 
as receptor blockers have been introduced, including psycho^harmaceuticals, 
beta-blockers, calcium antagonists, diuretics, new anesthetics, and anti- 
inflammatory preparations. 

It can b e estimated that about one thousand genes are involved in common 
diseases. The proteins associated with tiiese genes may not be all good drug 
targets, but among the dozms of proteins that participate in tiie regulatory 
patiiway, one can assume that at least three to five represent good drug targets. 
According to this estiniate, 3,000 to 5,000 proteins could become flie targets of 
new medicines, which is an order of magnitude greater than what is known today. 

With a typical drug development process costing about $300-500 million 
per drug, providing a better ranking of potential leads is of the utmost, 
importance. With the recent completion of the first draft of tiie human genome 
that revealed its 30,000 genes, and with tiie new microarray and combinatorial 
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chemistry technologies, the quantity and variety of genomics data are growing at 
a significantly more n^id pace than the infonnatics cq)acity to analyze them. 

The eniphasis of molecular biology is shifting fiom a hypothesis driven 
model to a data driven model. Previously, years of intense laboratory research 
5 were required to collect data and test hypotiieses regarding a single system or 
pathway and studying &e effect of one particular drug. The new data inta[isive 
paradigm relies on a combination of proprietary data and data gadiered and 
shared worldwide on tens of diousands of simultaneous miniaturized 
experiments. Bioinformatics is playing a crucial role in managing and analyzing 
10 this data. 

While drug development will still follow its traditional path of animal 
experimentation and clinical trials for the most promising leads, it is expected 
ttiat flie acquisition, of data firom arraying technology and combinatorial 
chemistiy followed by proper data analysis will considerably accelerate drug 

1 5 discovery and cut down the development cost 

Additionally, completely new areas will develop such as personalized 
medicine. As is known, a mix of genetic and environmental fectors causes 
diseases. Understanding the relationships between such factors promises to 
improve considerably disease prevention and yield to significant health care cost 

20 savings. With gaiomic diagnosis, it will also be possible to prescribe a well- 
targeted drug, adjust the dosage and monitor treatment 

Following the challenge of genome sequencing, it is generally recognized 
that the two most important bioinformatics challenges are microarray data 
analysis (with the analysis of tens of thousands of variables) and tiie construction 

25 of decision systems that integrate data analysis from different sources. The 
essence of the problem of designing good cost-effective diagnosis test or 
determining good drug targets is to establish a ranking among candidate genes or 
proteins, the most promising ones coming at the top of the list To be truly 
effective, such a ranked list must incorporate knowledge from a great variety of 

30 sources, including genomic DNA infonnation, gene e?q>ression, protein 
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concentration, and phannacological and toxicological data. Challenges include: 
analyzing data sets with few sanqiles but very large numbers of iiq)uts (thousands 
of gme expression coefEdaits fiom only 10-20 patients); using data of poor 
quality or incomplete data; combining heteiog^eous data sets visualizing results; 
5 incorporating the assistance of human experts complying with rules and checks 
for safety requiremmts satisfying economic constraints (e.g., selecting only one 
or two best leads to be pursuecQ; in the case of an aid to decision makers, 
providing justifications of the system's recommendations; and in the case of 
personalized medicine, making the information easily accessible to the public. 
10 Thus, the need exists for a system capable of analyzing combined data 

from a number of sources of varying quantity, quality and origin in order to 
produce useful informatioiL 

SUMMARY OF TBOE EPJVENTION 

15 In an exemplary embodiment, the data mining platform of the present 

invention comprises a plurality of system modules, each formed fi^om a plurality 
of components. Each module conq)rises an irq)ut data component, a data analysis 
engine for processing the input data, an output data component for outputting the 
results of the data analysis, and a web server to access and monitor the oth^ 

20 modules within the unit and to provide communication to other units. Each 
module processes a di£Ba^ typeof data, for exan^le, a first module processes 
microairay (gene expression) data while a second module processes biomedical 
literature on the Intemet for information supporting relationships between genes 
and diseases and gene fimctionality. In the preferred embodiment, the data 

25 analysis engine is a kernel-based learning machine, and in particular, one or more 
support vector machines (SVMs). The data analysis engine includes a pre- 
processing function for feature selection, for reduciug the amount of data to be 
processed by selecting the optimum number of attributes, or "features", relevant 
to the information to be discoveied. Li the preferred embodiment, the feature 

30 selection means is recursive feature elimination (RFE), such tiiat the preferred 
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embodimeat of the data analysis engine uses RFE-SVM. The output fhe data 
analysis engine of one module may be iiq>ut into the data analysis engine of a 
different module. Thus, the ou^ut data firom one module is treated as vaput data 
which would be subject to feature ranking and/or selection so Aat ttie most 
5 relevant features for a given analysis are takffl fiom different data sources. 
Alternatively, the outputs of two or more modules may be input into an 
independent data analysis engine so that the knowledge is progressively distilled. 
For example, aoalysis results of microarray data can be validated by comparison 
against docinnents retrieved in an on-line Uterature search, or the results of the 

10 different modules can be otherwise combined into a single result or format 

In the preferred embodiment of the data analysis engine, pre-processing 
can include identifying missiiig or erroneous data points, or outliers, and taking 
^ipropriate steps to correct the flawed data or, as q)propriate, remove the 
observation or the entire field from the scope of the problem. Such pr&- 

1 5 processing can be referred to as "data cleaning^' . Pre-processing can also include 
clustering of data, which provides means for feature selection by substituting the 
cluster cent^ for tiie features within that cluster, thus reducing the quantity of 
features to be processed The features remaining after pre-processing are tiien 
used to train a learning machine for purposes of pattern classification, regression, 

20 clustering and/or novelty detection. 

A test data set is pre-processed in the same manner as was the training 
data set. Then, the trained learning machine is tested using the pre-processed test 
data set. A test output of the trained learning machine may be post-processing to 
determine if the test ou^ut is an optimal solution based on known outcome of the 

25 test data set. 

In the context of a a kemel-based learning machine such as a support 
vector machine, the present invention also provides for flie selection of at least 
one kernel prior to trainmg the support vector machine. The selection of a kernel 
may be based on prior knowledge of the specific problem being addressed or 

30 analysis of the properties of any available data to be used with the learning 
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machine and is typically depeadmt on the nature of the knowledge to be 
discovered from the data. 

Kernels are usually defined for patterns ttiat can be represented as a vector 
of real numbers. For example, linear kernels, radial basis function kernels and 
5 polynomial kernels all measure ttie ^milarity of a pair of real vectors. Such 
kmiels are q)propriate when the patterns are best rq>resented as a sequmce of 
real numbers. 

An iterative process comparing postprocessed training outputs or test 
outputs can be applied to make a determination as to which kernel configuration 

10 provides the optimal solution. If the test output is not the optimal solution, the 
selection of the kernel may be adjusted and the support vector machine may be 
retrained and retested. Once it is determined that the optimal solution has been 
idmtified, a live data set may be collected and pre-processed in the same manner 
as was the training data set to select the features that best represent the data. The 

15 pre-processed five data set is input into the leaniing machine for processing. The 
five ou^ut of the learning machine may then be post-processed by interpreting 
fte five oulput into a computationally derived a^hanum^c classifier or other 
form suitable to fiirther utilization of the analysis results. 

The data mining platform of the present invmtion provides a tailored 

20 analysis for ^pfication to novel data sources. In the preferred embodiment, 
support vector machines are integrated at multiple levels, e.g., at each module and 
for processing of the combined results of two or more modules. 

Brief Description Of The Drawings 
25 FIG. 1 is a flowchart illustrating an exemplary genial method for 

increasmg knowledge that may be discovered from data using a learning machine. 

PIG. 2 is a flowchart iUustrating an exemplary method for increasing 
knowledge tiiat may be discovered from data using a si^port vector machine. 

FIG. 3 is a flowchart iUustrating an exemplary optimal categorization 
30 method that may be used in a stand-alone configuration or in conjunction with a 
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learning machine for pre-processing or post-processing techniques in accordance 
with an exemplary embodiment of the present invention. 

FIG. 4 is a functional block diagram iUustrating an exemplary operating 
environment for an embodiment of the present invention. 
5 FIG. S is a functional block diagram illustrating a hierarchical system of 

multiple siq>port vector machines. 

HG. 6 is a block diagram of the generic architecture of a module of &e 
data mining platform. 

HG. 7 is a block diagram of an exemplary embodiment of the data mining 
1 0 platform having two modules for processing, Hien combining, two di£Gs:ent kinds 
of iiq)ut data. 

FIG. 8 is a block diagram of the global architecture of a module of tiie 
data mining platform. 

HG. 9 is an exemplary screen shot of an interface for the Gene Search 
15 Assistant application for bioinformatics for use in searching published 
information. 

HG. 10 is an exemplary soeen shot of an interface for the Gene Search 
Assistant application for bioinformatics for displaying results of a searctt 

HG. 1 la is a color m^ for visualizing gene ranking results; HG. lib is a 
20 display of a nested subsets of features using the color map of HG. 1 la to assist in 
visualization of the ranking. 

HG. 12 is bar diagram showing a nested subset of featureis for visualizing 
gene ranking results. 

HG. 13 illustrates an exenqplaiy screen shot of an intaface generated by 
25 the "Gene Tree Explorer" program implemented according to the present 
invention for analysis of gene e3q;)ression data. 

HG. 14 is an exemplary gene observation graph, or "gene tree**. 



30 
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DETAILED DESCRIPTION OF TBDE: PREFERKBD EMBODIMENTS 

As used herein, "biological data" means any data dmyed from measuring 
biological conditions of human, animals or other biological organisms including 
miciooiganisms, vimses, plants and oth^ living organisms. The measurements 
5 may be made by any tests, assays or observations that are known to physicians, 
scientists, diagnostidaos, or the like. Biological data may include, but is not 
limited to, clinical tests and observations, physical and chemical measurements, 
genomic determinations, proteomic determinations, drug levels, hormonal and 
immimological tests, neurochemical or neurophysical measurements, mineral and 
10 vitamin level determinations, genetic and famihal histories, and other 
determinations that may give insight into the state of the individual or individuals 
that are undergoing testing. The term "data" is used interchaagqably with 
"biological data". 

While several examples of learning machines exist and advancements are 
15 expected in this field, the exemplary embodiments of the present invention focus 
on kernel-based learning machines and more particul^y on tiie si^port vector 
machine. 

The present invention can be used to analyze biological data generated at 
multiple sta^ of investigation into biological functions, and fiirttier, to integrate 

20 the different kinds of data for novel diagnostic and prognostic determinations. 
For example, biological data obtained fix>m clinical case information, such as 
diagnostic test data, family or genetic histories, prior or current medical 
treatments and the clinical outcomes of such activities, and published medical 
literature, can be utilized in the method and system of the present invention. 

25 Additionally, clinical samples such as diseased tissues or fluids, and normal 
tissues and fluids, and cell sq)arations can provide biological data that can be 
utilized by the current invention. Proteomic determinations such as 2-D gel, mass 
spectrophotometry and antibody screening can be used to establi^ databases that 
can be utilized by the present invention. Genomic databases can also be used 

30 alone or in combination with the above-described data and databases by the 
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present invention to provide cozq)rehCTsive diagnosis, prognosis or predictive 
capabilities to the user of the presoit invention. 

A jBist aspect of the present invention fiicilitates analysis of data by pre- 
processing the data prior to using the data to train a learning machine and/or 
5 optionaUy post-processing the output fironi a leanung machine. GoieraUy stated, 
pre-processing data comprises reformatting or augmenting the data in order to 
allow the learning machine to be applied most advantageously. More 
specifically, pre-processing involves selecting a method for reducing the 
dimensionality of the feature space, i.e., selecting the features which best 

10 represent flie data. In the preferred embodiment, recursive feature elimination 
(RFE) is used, however, otiier methods may be used to select an optimal subset of 
features, such as those disclosed in co-pending PCT q>plication S^al No. 
PCT/US02yi6012, filed m the U.S. Receiving OfiBce on May 20, 2002, entitled 
'"Methods for Feature Selection in a Learning Machine'*, which is incorporated 

15 herein by reference. The features remaining after feature selection are then used 
to train a learning machine for purposes of pattern classification, regression, 
clustoing and/or novelty detection. 

Jn a manner similar to pre-processing, post-processing involves 
interpreting the output of a learning machine in order to discover meaningfid 

20 characteristics fliereof. The meaningfiil characteristics to be ascertained fit>m flie 
output may be problem- or data-specific. Post-processing involves inteEpreting 
the output into a form tiiat, for example, may be understood by or is otherwise 
useful to a human observer, or cx)nverting the output into a form which may be 
readily received by another device for, e.g., archival or transmission. 

25 FIG. 1 is a flowchart illustrating a general method 100 for analyzing data 

using learning machines. The method 100 begins at starting block 101 and 
progresses to step 102 where a specific problem is formalized for application of 
analysis through machine learning. Particularly important is a proper formulation 
of the desired output of tiie learning machine. For instance, in predicting fixture 

30 performance of an individual equity instrument, or a market index, a learning 
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machine is likely to achieve better perfonnance when predicting the e3q>ected 
fiiture change rather than predicting the future price level. The future price 
expectation can later be derived in a post-processing step as will be discussed 
later in this specificatioa 
5 After problem formalization, step 103 addresses training data collection. 

Training data comprises a set of data points having Imown characteristics. This 
data may come fiom customers, research facilities, academic institutions, national 
laboratories, commercial entities or other public or confidaitial sources. The 
source of the data and the types of data provided are not crucial to the methods. 

10 Training data may be collected from one or more local and/or remote sources. 
The data may be provided through any means such as via the internet, server 
linkages or discs, CD/ROMs, DVDs or other storage means. The collection of 
training data may be accomplished manually or by way of an automated process, 
such as known electronic data transfer methods. Accordingly, an exemplary 

IS embodiment of the learning machine for use in conjunction with the present 
invention may be implemmted in a networked computer environment. 
Exemplary operating envirormients for iroplementing various embodiments of the 
learning machine will be described in detail with respect to FIGS. 4-S. 

At step 104, the collected traicdng data is optionally pre-processed in 

20 order to allow the learning machine to be applied most advantageously toward 
extraction of the knowledge inherent to the training data. During this 
preprocessing stage a variety of different transformations can be performed on flie 
data to enhance its usefulness. Such transformations, exan^les of which include 
addition of expert information, labeling, binary conversion, Fourier 

25 transformations, etc., will be readily parent to those of skill in the art. 
However, the preprocessing of interest in the present invention is the reduction of 
dimensionality by way of feature selection, different methods of which are 
described in detail below. 

Returning to FIG. 1, an exemplary method 100 continues at step 106, 

30 whm the learning niachine is trained using the pre-processed data. As is known 
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in the art, a leanung machine is trained by adjusting its operating parameters mitil 
a desirable training output is achieved. The determination of ^etilier a training 
output is desirable may be accomplished either manually or automatically by 
comparing the training output to ttie known characteristics of the training data. A 
5 learning madiine is consid^ed to be trained when its training output is wittiin a 
predetermined error threshold fiom ttie known charact^istics of flie training data, 
hi certain situations, it may be desirable, if not necessary, to post-process the 
training output of the learning machine at step 107. As mentioned, post- 
processing the output of a learning machine involves interpreting the ou^ut into a 

10 meaningful form. In the context of a regression problem, for example, it may be 
necessary to determine range categorizations for the output of a learning machine 
in order to determine if the input data points were correctly categorized Jn the 
example of a pattern recognition problem, it is often not necessary to post-process 
the training output of a learning machine. 

15 At step 1 08, test data is optionally collected in preparation for testing the 

trained learning machine. Test data may be collected fix)m one or more local 
and/or remote sources. In practice, test data and training data noiay be collected 
from the same source(s) at the same time. Thus, test data and training data sets 
can be divided out of a conmion data set and stored in a local storage medium for 

20 use as dififerent input data sets for a leanung machine. Regardless ofhow the test 
data is coUected, any test data used must be pre-processed at step 1 10 in the same 
manner as was the training data. As should be apparent to tiiiose skilled in the art, 
a proper test of the learning may only be accomplished by using testing data of 
the same format as the training data. Then, at step 112 the learning machine is 

25 tested using the pre-processed test data, if any. The test output of the learning 
machine is optionally post-processed at step 114 in order to determine if the 
results are desirable. Again, the post processing step involves mt^reting the test 
output into a meaningful fonn. The meaningful form may be one that is readily 
understood by a human or one that is compatible with another processor. 

30 Regardless, the test output must be post-processed into a form which may be 
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coiKq)ared to the test data to determine whether the results weie desirable. 
Examples of post-processing steps include but are not limited of the following: 
optimal categorization determinations, scaling techniques (linear and non-linear), 
transformations (linear and non-linear), and probability estimations. The mettiod 
5 100 mds at step 116. 

ELG. 2 is a flow chart illustrating an exemplary method 200 for enhancing 
knowledge that may be discovered fiom data using a specific type of learning 
machine known as a support vector machine (SVM). A SVM implements a 
specialized algorithm for providing generalization when estimating a multi- 

10 dimensional function Scorn sl limited collection of data. A SVM may be 
particularly useful in solving dependency estimation problems. More 
specifically, a SVM may be used accurately in estimating indicator functions (e.g. 
pattern recognition problems) and real-valued functions (e.g. function 
approximation problems, regression estimation problems, density estimation 

1 5 problems, and solving inverse problems). The SVM was originally developed by 
Vladimir N. V^nik. The concepts underlying the SVM are explained in detail in 
his book, entitled Statistical Leaning Theory (John Wiley & Sons, hic. 1998), 
which is herein incorporated by reference in its entir^. Accordingly, a 
&miharity with SVMs and the temiinology used tfaerewifli ate presumed 

20 throughout this specification. 

The exemplary method 200 beguis at starting block 201 and advances to 
step 202, where a problem is formulated and then to step 203, wh^e a training 
data set is collected. As was described with reference to FIG. 1, training data 
may be collected fix)m one or more local and/or remote soxnces, througji a manual 

25 or automated process. At step 204 the training data is optionally pre-processed. 
Those skilled in the art should ^reciate that SVMs are enable of processing 
input data having extremely large dimensionahty, however, according to the 
present invention, pre-processing includes the use of feature selection methods to 
reduce the dimensionality of feature space. 
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At step 206 a kernel is selected for the SVM. As is known in the art, 
different kernels will cause a SVM to produce varying degrees of quality in the 
output for a given set of iiqiut data. Th^fore, the selection of an appropriate 
k^el may be ess^tial to the desiied quality of the ou^ut of the SVM. In one 
S embodiment of the learning machine, a kernel may be chosen based on prior 
p&rfoTsnBDCG knowledge. As is known in the art, exemplary konels include 
polynomial kernels, radial basis classifier kernels, linear kernels, etc. In an 
alternate embodiment, a customized kemel may be created that is specific to a 
particular problem or type of data set In yet another embodiment, the multiple 

10 SVMs may be trained and tested simultaneously, each using a different kemel. 
The quality of the outputs for each simultaneously trained and tested SVM may 
be con^ared using a variety of selectable or weighted metrics (see step 222) to 
detomine the most desirable kemel. 

Next, at step 208 the pre-processed training data is iiiput into the SVM. 

IS At step 210, the SVM is trained using the pre-processed training data to generate 
an optimal hyperplane. Optionally, the training ou^ut of the SVM may Hiea be 
post-processed at step 211. Again, post-processing of training output may be 
desirable, or even necessary, at this point in order to properly calculate ranges or 
cat^ories for the output At step 212 test data is collected similarly to previous 

20 descriptions of data collection. The test data is pre-processed at step 214 in the 
same manner as was the training data above. Then, at step 216 the pre-processed 
test data is input into the SVM for processing in order to determine whether the 
SVM was trained in a desirable manner. The test output is received from the 
SVM at step 218 and is optionally post-processed at step 220. 

25 Based on the post-processed test output, it is determined at step 222 

whether an optimal minimum was achieved by the SVM. Those skilled in the art 
should s^redate that a SVM is operable to ascertain an output having a global 
minimum eiror. However, as mentioned above, output results of a SVM for a 
given data set will typically vary with kemel selection. Thorefore, there are in 

30 &ct multiple global minimums that maybe ascertained by a SVM for a given set 
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of data. As used herein, the teem **optimal minimiim'' or ^optimal solution" 
refers to a selected global minhniTm that is considered to be optimal (e.g. Ihe 
optimal solution for a given set of problem specific, pre-established criteria) 
when compared to other global mini'miimg ascertained by a SVM. Accordingly, 
5 at step 222, determining whether die optimal minimum has been ascertained may 
involve comparing tiie ou^ut of a SVM wifli a historical or predetermined value. 
Such a predetermined value may be dependant on the test data set. For example, 
in the context of a pattem recognition problem where data points are classified by 
a SVM as either having a certaiii characteristic or not having the characteristic, a 
. 10 global minimum error of 50% would not be optimal. In this exanq>le, a global 
minimum of 50% is no better than the result that would be achieved by flippiug a 
coin to determine whether the data point had that characteristic. As another 
example, in the case where multiple SVMs are trained and tested simultaneously 
with varying kernels, die outputs for each SVM may be compared with output of 

15 other SVM to determine the practical optimal solution for that particular set of 
kernels. The determination of whether an optimal solution has been ascertained 
may be performed manually or through an automated comparison process. 

If it is deteimined that the optimal miniTnimn has not been achieved by the 
trained SVH the method advances to step 224, where the kernel selection is 

20 adjusted. Adjustment of the kernel selection may comprise selecting one or more 
new kernels or adjusting kernel parameters. Furfliennore, in Ihe caise where 
multiple SVMs were trained and tested simultaneously, selected kernels may be 
rq)laced or modified while other kemels may be re-used for control pmposes. 
After the kernel selection is adjusted, the method 200 is repeated fiom step 208, 

25 where the pre-processed training data is input into the SVM for training purposes. 
When it is determined at stqp 222 that the optimal TninimnTn has been achieved, 
the metiiod advances to step 226, where Uve data is collected similarly as 
described above. By definition, live data has not been previously evaluated, so 
that the desired output characteristics that w^ known with respect to the training 

30 data and the test data are not knowiL 
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At Step 228 the live data is pre-processed in the same maimer as was the 
training data and the test data. At step 230, the live pre-processed data is input 
into the SVM for processing. The live ou^ut of the SVM is received at step 232 
and is post-processed at step 234. The method 200 ends at step 236. 
5 FIG. 3 is a flow chart illustrating an exenqplary optimal categorization 

method 300 that may be used for pre-processing data or post-processing ou^ut 
fiom a learning machine. Additionally, as will be described below, tiie 
exemplary optimal categorization method may be used as a stand-alone 
categorization technique, independent from learning machines. The exemplary 

10 optimal categorization method 300 begins at startiag block 301 and progresses to 
step 302, where an input data set is received. The input data set comprises a 
sequence of data samples from a continuous variable. The data samples fall 
within two or more classification categories. Next, at step 304 the bin and class- 
tracking variables are initialized. As is known in the art, bin variables relate to 

15 resolution, while class-tracking variables relate to the number of classifications 
within die data set Detecnoining the values for initialization of the bin and class- 
tracking variables may be performed manually or through an automated process, 
such as a computer program for analyzing the mput data set At step 306, the 
data entropy for each bin is calculated. Entropy is a maOiematical quantity that 

20 measures the uncertainty of a random distributioxL In the exemplary method 300, 
entropy is used to gauge the gradations of flie input variable so that maxiTniiTn 
classification c^ability is adueved. 

The method 300 produces a series of "cuts" on the continuous variable, 
such that the continuous variable may be divided into discrete categories. The 

25 cuts selected by the exemplary method 300 are optimal in the sense that the 
avCTage entropy of each resulting diso^e category is minimized. At stq> 308, a 
determination is made as to whether all cuts have been placed within input data 
set comprising the continuous variable. If all cuts have not been placed, 
sequential bin combinations are tested for cutofif determination at step 310. From 

30 step 310, the exemplary method 300 loops bade through step 306 and returns to 
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Step 308 where it is again detennined wfaeQier all cuts have heea placed wifliin 
'mpxst data set comprising the continuous variable. When all cuts have been 
placed, the entropy for the mtire system is evaluated at stq) 309 and compared to 
previous results from testing more or fewer cuts. If it cannot be concluded that a 
S minimum entropy state has he&i determined, then other possible cut selections 
must be evaluated and the meQiod proceeds to stq) 311. From step 311 a 
heretofore untested selection for number of cuts is chosen and the above process 
is repeated fiom step 304. When either the limits of the resolution determined by 
the bin width has been tested or the convergence to a mmiTniTm solution has been 

10 identified, the optimal classification criteria is output at step 312 and the 
exenq>lary optimal categorization method 300 ends at step 3 14. 

The optimal categorization method 300 takes advantage of dynamic 
programming techniques. As is known in the art, dynamic programming 
techniques may be used to significantly improve the efSciency of solving certain 

1 5 complex problems through carefiiUy structming an algorithm to reduce redundant 
calculations, hi the optimal categorization problem, the straightforward approach 
of exhaustively searching throng all possible cuts in the continuous variable data 
would result in an algorithm of exponential complexity and would render the 
problem intractable for even mod^te sized mputs. By taking advantage of ttie 

20 additive property of the target Amotion, in this problem the average entropy, the 
problem may be divide into a series of sub-problems. By properly formulating 
algorithmic sub-structures for solving each sub-problem and storing the solutions 
of the sub-problems, a significant amount of redundant computation may be 
identified and avoided. As a result of using the dynamic programming approach, 

25 the exemplary optimal categorization method 300 may be implemented as an 
algorithm having a polynomial complexity, which may be used to solve large 
sized problems. 

As mentioned above, the exemplary optimal categorization method 300 
may be used ui pre-processing data and/or post-processing the ou^ut of a 
30 learning machine. For example, as a pre-processmg transformation step, the 
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exraiplaiy optimal categoiization method 300 may be used to extract 
classification information fiom raw data. As a post-processing technique, the 
exemplary optimal range cate^rization method may be used to detenmne the 
optimal cut-off values for maikets objectively based on data, rather than relying 
5 on ad hoc approaches. As should be parent, fhe exCTiplary optimal 
categorization method 300 has 2q>pIications in pattern recognition, classification, 
regression problems, etc. The exenq)Iary optimal categoiization method 300 may 
. also be used as a stand-alone categorization technique, independent fiom SVMs 
and other learning machines. 

10 FIG. 4 and the following discussion are intended to provide a brief and 

genial description of a suitable computing environment for implemeuting 
biological data analysis according to the present invention. Although the system 
shown in FIG. 4 is a conventional personal contputer 1000, those skilled in the art 
will recognize that the invention also may be implemented using other types of 

IS computer system configurations. The computer 1000 includes a central 
processing unit 1022, a system memory 1020, and an Irqput/Ou^iat C^O**) bus 
1026. A system bus 1021 couples the central processing unit 1022 to the system 
memory 1020. A bus controller 1023 controls the flow of data on the I/O bus 
1026 and between the central processing unit 1022 and a variety of internal and 

20 eternal I^O devices. The I/O devices coimected to the I/O bus 1026 may have 
direct access to the system memory 1020 using a Direct Memory Access 
("DMA") controller 1024. 

The 1^0 devices are connected to the I/O bus 1026 via a set of device 
interfaces. The device interfaces may include both hardware components and 

25 software components. For instance, a hard disk drive 1030 and a floppy disk 
drive 1032 for reading or writing ronovable media lOSO may be coimected to the 
I/O bus 1026 through disk drive controllers 1040. An optical disk drive 1034 for 
reading or writing optical media 1052 may be connected to the I/O bus 1026 
using a Small Computer System Interfece ("SCSr) 1041. Alternatively, an IDE 

30 O^tegrated Drive Electronics, i.e., a hard disk drive interface for PCs), ATAPI 
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(ATtAchmeDt Packet JnterEacG, ie., CD-ROM and t^e drive interface), or EIDE 
(Enhanced IDE) interface may be associated with an optical drive such as may be 
ttie case with a CD-ROM drive. The drives and their associated conaputer- 
readable media provide nonvolatile storage for the computer 1000. In addition to 
5 fte computer-readable media described above, other types of computer-readable 
media may also be used, such as ZIP drives, or the like. 

A di^Iay device 1053, such as a monitor, is connected to the 1/0 bus 
1026 via another interface, such as a video adapter 1042. A parallel int^:&ce 
1043 connects synchronous peripheral devices, such as a laser printer 1056, to ttie 

10 I/O bus 1026. A serial interface 1044 connects communication devices to the VO 
bus 1026. A user may enter commands and information into the computer 1000 
via the serial interface 1044 or by using an input device, such as a keyboard 1038, 
a mouse 1036 or a modem 1057. O&er peripheral devices (not shown) may also 
be connected to the computer 1000, such as audio input/ou^ut devices or image 

1 5 capture devices. 

A number of program modules may be stored on fbo drives and in the 
system memory 1020. The system memory 1020 can include both Random 
Access Memory ("RAM") and Read Only Manory ("ROM")- The program 
modules control how the con^ut^ 1000 functions and interacts with flie user, 

20 vnfh VO devices or with other conq)Uters. Program modules include routines, 
. operating systems 1065, s^plication programs, data structures, and other software 
or firmware components. In an illustrative embodiment, the learning machine 
. may comprise one or more pre-processing program modules 1075 A, one or more 
post-processing program modules 1075B, and/or one or more optimal 

25 categorization program modules 1077 and one or more SVM program modules 
1070 stored on the drives or in the system memory 1020 of the conoputer 1000. 
Specifically, pre-processing program modules 1075A, post-processing program 
modules 1075B, together with the SVM program modules 1070 may coxapnse 
compute-executable instmctions for pre-processing data and post-processing 

30 output 6om a learning machine and implementing Hie learning algorithm 
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according to the exemplary methods desoibed witti reference to FIGS. 1 and 2. 
Furthermore, optimal categorization program modules 1077 may comprise 
computer-executable instructions for optimally cate^rizing a data set according 
to ttie exemplary methods described with reference to FIG. 3. 
5 The conq)uter 1000 may operate in a n^oiked environment using 

logical connections to one or more remote computers, sudi as remote computer 
1060. The remote computer 1060 may be a server, a router, a peer device or 
other common netwoik node, and typically includes many or all of tiie elements 
described in connection with the computer 1000. hi a networked environment, 

10 program modules and data may be stored on the remote computer 1060. The 
logical connections depicted in FIG. 4 include a local area network ("LAN") 1054 . 
and a wide area network C'WAN") 1055. hi a LAN environment, a network 
inter&ce 1045, such as an Ethernet adq)ter card, can be used to connect the 
computer 1000 to the remote computer 1060. In a WAN CTvironment, the 

15 computer 1000 may use a telecommunications device, such as a modem 1057, to 
establish a connection. It will be appreciated that the netwoik connections shown 
are illustrative and other devices of establishing a communications link between 
the computers may be used. 

hi anodier embodiment, a plurality of SVMs can be conjSgured to 

20 hierarchically process multiple data sets in parallel or sequentially, hi particular, 
one or more first-level SVMs may be trained and tested to process a first type of 
data and one or more first-level SVMs can be trained and tested to process a 
second type of data. Additional types of data may be processed by other first- 
level SVMs. The output Scorn some or all of flie first-level SVMs may be 

25 combined in a logical manner to produce an input data set for one or more 
second-level SVMs. In a similar fashion, output fcom a plurahty of second-level 
SVMs may be combined in a logical manner to produce ii^ut data for one or 
more third-level SVM. The hierarchy of SVMs may be expanded to any number 
of levels as may be appropriate, hi this manner, lower hierarchical level SVMs 

30 may be used to pre-process data that is to be input into higher level SVMs. Also, 
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higher hierarchical level SVMs may be used to post-process data that is output 
fipom lower hierarchical level SVMs. 

Each SVM in the hierarchy or each hierarchical level of SVMs may be 
configured with a distinct kernel. For example, SVMs used to process a first type 
5 of data may be configured with a first type of kernel while SVMs used to process 
a second type of data may utilize a second, dififearent type of kernel. In addition, 
multiple SVMs in the same or different hierarchical level m^ be configured to 
process the same type of data using distinct kernels. 

FIG. S illustrates an exemplary hierarchical system of SVMs. As shown, 

10 one or more first-level SVMs 13Q2a and 1302b may be trained and tested to 
process a first type of iiq)ut data 1304a, such as mammogr^hy data, pertaining to 
a sample of medical patients. One or more of these SVMs may comprise a 
distinct kernel, indicated as ^TKERNEL 1" and **KERNEL 2^ Also, one or more 
additional first-level SVMs 1302c and 1302d maybe trained and tested to process 

15 a second type of data 1304b, which may be, for example, genomic data for the 
same or a different sample of medical patients. Again, one or more of the 
additional SVMs may comprise a distinct kernel, indicated as ^'KERNEL 1" and 
'^KERNEL 3". The output firom each of the like first-level SVMs may be 
compared with each other, e.g., 1306a compared wifli 1306b; 1306c conqiared 

20 with 1306d, in order to det^mine optimal outputs 1308a and 1308b. Thra, the 
optimal outputs fi:om the two groups or first-level SVMs, i.e., outputs 1308a and 
1308b, may be combined to form a new multi-dimensional input data set 1310, 
for example, relating to mammogrq)hy and genomic data. The new data set may 
then be processed by one or more ^ropriately trained and tested second-level 

25 SVMs 1312a and 1312b. The resulting outputs 1314a and 1314b fi-om second- 
level SVMs 1312a and 1312b may be compared to determine an optimal output 
1316. Optimal output 1316 may identify causal relationships between the 
mammography and genomic data points. As should be apparent to those of glHll 
in the art, other combinations of hierarchical SVMs maybe used to process either 
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in parallel or serially, data of diff^ent types in any field or industry in which 
analysis of data is desired. 

FIG. 6 illustrates the genial architecture of a module 400 of the data 
mining platform of the preset inventioiL The central con^nent of module 400 
5 is the data analysis ragine 404, which, in the prefeired embodiment is one or 
more SVMs. The analysis engine 404 processes raw inpxst data fi:om ixsput 
database 402 to produce intelligent ou^ut data at ou^ut database 406. A web 
server 410 offers an inter&ce to view the data and monitor the engine. Raw iiq)ut 
data may be unstructured or weakly structured, and may be in the form o£ for 

1 0 example, sequences of characters or a table of coeflBcients. Output data are more 
structured and typically are in the form of gr^hs, e.g., decision trees, networks, 
nested subsets, or ranked lists. 

The basic building block of the module comprises a web interface to a 
software package (written in scientific language such as MafLab®, a scripting 

IS language such as Perl, or a programming language such as C or C++) and 
databases holding input data and the results of the computation, i.e., output data. 
A wizard program guides the researcher through a deployment procedure. A key 
a^ect of this syston is its modularity, which allows the assembly of modules for 
processing heterogeneous data sets in a hia:archical structure, or modules maybe 

20 cascaded. As illustrated in FIG. 7, two modules SOO and 550 each possess ttie 
basic module components of the input database 502, 552, the data analysis engine 
504, 554 and the output database 506, 556. (Note that the web server 510 is not 
shown within the dashed lines used to indicate each module because it is shared 
by the two modules.) An additional data analysis engine 520 is provided to 

25 receive, combine and process the outputs of the two modules 500 and 550. The 
results of this second analysis operation are then provided to output database 522. 
As indicated in the drawing, each component of each module 500, 550, the 
second analysis engine 520 and the second output database 522 are in 
communication with the web server 510 to provide for viewing and monitoring. 
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The essential (liffei:eQce between any tw^o modules within ttie platfonn 
resides in the type of information each one processes. For exanq)le, one module 
may process numerical or structured data, y/balo another mig^t process textual or 
unstructured data. Both modules rely on the same architercture and require 
5 equally powerful maching learning and statistical techniques for data analysis and 
data mining. 

In a first exemplary embodiment, the basic module is an enterprise server 
application accessible fiom a regular Memet Web browser. Its purpose is to 
oSor on-line access to nimaerical simulations through a user-friendly interfece. 

10 There are three main functions. First is the computer engine monitor, in which a 
user can set parameters, ioput raw data, and order a new batch of numerical 
experiments. Second is the data explorer, through which the user can recall the 
results of nimierical experiments and view them in text and/or graphical format. 
Third is general services and infrastructure which provides for pre- and post-login 

15 scre^is. The module server is a restricted- area accessible only by the server 
operator and authorized customers and guests. Data traiismission between the 
browser and server is encrypted (HTTPS protocol) and user authentication 
through a login password combination is required to enter the site. The server is 
preferably protected behind a firewall to protect the operator's network. Access 

20 may be divided into categories, with difiT^ent levels of access according to the 
user's category. 

In the preferred embodiment, the module server is implemented in Java™ 
and adopts Sun Microsystems' specification of the Java 2 Platform, Enterprise 
Edition ("J2EE''). This platform is an architecture for developing, deploying and 
25 executing applications in a distributed environment These q)plications require 
system-level services, such as transaction management, security, client 
coimectivity, and database access. The primary benefits of Java and J2EE include 
power of expression of object-oriented coding; ease of inq)lementation, 
deployment and maintenance; clean, modular and scalable architecture; 
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distributed application across processes and machines; code portability across 
different operating systems and different J2EE platform vendors. 

An in^ortant exception to implmentation of the module using Java is 
that die data analysis engine at ttie core of the module will preferably be 
5 implemented using C, or will be executed inside a third-party environment such 
as MatLab^. For increased efficiency, the engine processes can be hosted in a 
dedicated, extra-powered machine. 

The J2EE platform is thus reserved for the hi^-level control of the 
workflow, whereas low-level computation intensive tasks are performed outside 

10 of the Java enviroinnent 

FIG. 8 illustrates the global architecture of the basic module, wifli die 
input and output data being grouped into the '*Data Center'' 802. The main 
appUcation is split into a servo: part (J2EE) 804 and a computational part 806. 

The components of the Data Center 802 will vary depending on the type 

15 of data to be processed and generated. For example, in the case of gene 
expression data to be analyzed for diagnosing a disease, the input data 820 is raw 
data, and the output data 830 is numerical results. In the case of a search of 
published literature for information on gene-disease-org^n relationships , the 
input data 820 is obtained firom the a search of the fiitemet and tiie output data 

20 830 is structured information. 

' In the J2EE ^plication model, the "Java beans" 812 are the business 
objects, that reside in the application server 810. "Java beans" typically contain 
the logic that operates on the data. Examples of beans for ^pUcation to 
bioinfoimatics according to the present invention are gene datasets, gene tree 

25 results, patient records, documents, document search results, etc. The Web pages 
816 are die business objects residmg in the Web server 814. Connectors provide 
a low level infrastructure of conmnmication between servers, clients and 
database systems using Java protocols such as Remote Method Invocation (RMI) 
and Java Database Connection (JDBC). The mathemathical al^rithms whidb 

30 reside in Computer Center 806, are not inq)lemented in Java. Instead the analysis 
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ragine is implemented in Perl, C or MatLab 840. Interface with &e user is 
provided via a Web browser 860. The J2EE server platform is well known in tihe 
art and, therefore, &rfiier explanation is not provided. 

In an exemplary application to bioinfonnatics, the system according to the 
5 present invention provides for the following operations: 1) Data acquisition: 
For each event to be recorded (called a "pattern", widch may correspond to a 
patient, a tissue, etc.) transform the hsputs obtained by a sensor (such as a DNA 
microarray, a spectrometer) or transform textual information into a fixed 
dim^ision vector. The elements of the vector (called "features", input variables, 

1 0 components, or attributes) are computed Scorn the data coming fcom the sensors 
or the text in a fixed determined way. The result of data acquisition is a matrix of 
patteans (lines) of features (columns), or vice versa. As previously stated, one 
module is provided for each type of data, 2) Preprocessing: Use a combination 
of normalization (e.g., subtracting mean and dividing by standard deviation) and 

15 non-linear filtering. Both the lines and the columns offhe matrix may be 

normalized and filtered. These and other pre-processing operations are described 
below. 3) Baseline performance and data cleaning: Botii manual and automatic 
cleaning of data are possible. The principle is to detect outliers that are difScult 
to predict with a learning machine. Details are provided provided below hi the 

20 discussion of data cleaning algorithms. 

4) Evaluate the problem dimensionality and difficulty: Use Principal 
Component Analysis (PCA), which is known in the art (see, e.g.. Chapter 14 on 
**Kemel Feature Extraction" in Learning with Kernels. B. Scholkopf and A J. 
Smola, 2002 MTT Press, which is incorporated herein by reference) to find the 

25 optimum number of principal components. Possible methods include, but are not 
limited to, using the g^ statistic of Tishby et al. or by building a predictor. 
Attempt to build predictors with single genes. For classification, detennine 
whether the problem is hnearly separable with all genes and with single genes. 
Determine whether non-linear leaming machines yield better leave-one-out 

30 performance on all genes and on the first few principal components. 5) Restrict 
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the subset of methods to be further investigated: Given the analysis of problem 
dimensionality and difficulty, select a subset of learning machines witii which to 
proceed (linear/non-Iinear, univariate with single gene ranking or multivariate 
with gene subset selection, e.g. RFE). 
5 6) Run methods and generate plots ofouter loop Qioss Validation: 

Plot training and test performance as a function of number of genes to find the 
optimum number of genes; and plot training and test performance as a function of 
number of training examples for a given numb^ of genes. D^ermine whether 
the data set size is sufficient to achieve asymptotic behavior. 7) Diagnose 

10 possible overfitting: If the multiple univariate methods give better results than 
the multivariate methods, it is possible that &e multivariate feature selection 
method is overfitting. If overfitting of multivariate methods is diaguosed, try to 
improve performance by regularizing using clustCTng as a preprocessing 
operation and running gene selection on cluster centers. Other means of 

1 5 regularizing include restricting the RFE search to genes pre-selected by 

correlation methods or p^ializing ttie removal of genes correlated with &e ideal 
gene during RFE. (See the discussion below on RFE and pre-processing 
methods.) If no overfitting of multivariate me&ods is diagnosed, run siq)ervised 
clustering of the remaining geues using as cluster centers ttie elements of the best 

20 gene subset and generate a tree of genes to provide alternate gene subsets. For 
purposes of efficiency, fiie tree may be built with the top best genes only. 

8) Validate results using information retrieval: Calculate the relevance of 
genes to the disease at hand according to dociunents retrieved on-line. Compare 
the gene sets selected by various methods using such relevance criterion. Provide 

25 a tentative gene functional classificatiorL 9) Combine the results obtained firom 
data coming firom various sources by combining the various feature structures 
obtained (tree, ranked lists, etc.) into a single structure. 10) Visualize the results: 
Display the results and let the user browse through the results and display for 
given features the original information. 
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The following sections describe the various listed functions tiiat are 
perfoimed within ttie data mining platfonn according to ttie present inventioiL 

Pre-processing Functions . Pre-processing can have a strong inq)act on SVM. 
5 In particular, feature scales must be comparable. A number of possible pre- 
processing methods may be used individually or in combinatiorL One possible 
pre-processing method is to subtract the mean of a feature from each feature, then 
divide &e result by its standard deviation. Such pre-processing is not necessary if 
scaling is taken into account in the computational cost fimctiotL Anotiier pre- 

10 processing operation can be performed to reduce skew in the data distribution and 
provide more uniform distribution. This pre-processing step involves taking the 
log of the value, which is particularly advantageous when the data consists of 
gene expression coef&cients, which are often obtained by computing the ratio of 
two values. For example, in a competitive hybridization scheme, DNA from two 

15 samples that are labeled differentiy are hybridized onto the array. One obtains at 
evecy point of the array two coefficients corresponding to the fluorescence of tiie 
two labels and reflecting the firaction of DNA of either sanq)le that hybridized to 
the particular gene. Typically, the first initial pr^rocessing step that is taken is to 
take the ratio a/b of these two values. Altiiough this initial preprocessing step is 

20 adequate, it may not be optimal when the two values are small. Other initial 
preprocessing steps include (a-b)/(arfb) and (log a - log b)/(log a + log b). 

Another pre-processing step involves normalizing the data across all 
samples by subtracting the mean. This preprocessing step is supported by the feet 
that, using jdssue samples, there are variations in experimental conditions from 

25 microarray to microarray. Although standard deviation seems to remain fairly 
constant, another possible preprocessing step was to divide the gene expression 
values by the standard deviation to obtain centered data of standardized variance. 

To normalize each gene expression across multiple tissue sanq)les, the 
mean e3q)ression value and standard deviation for each gene can be conq)uted. 

30 For all the tissue sample values of that gme (training and test), that mean is &en 
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subtracted and the resultant value divided by the standard deviation. Jn some 
&q)erimaits, an additional preprocessing step can be added by passing the data 
through a squashing function [l(x) = cantan (x/c)] to diminish tiie importance of 
the outliers. 

5 In a variation on several of the preceding pre-processing mettiods, the data 

can be pre-processed by a simple **whitming" to make data matrix resemble 
"white noise/' The samples can be pre-processed to: normalize matrix columns; 
normalize matrix lines; and normalize columns again. Normalization consists of 
subtracting the mean and dividing by die standard deviatioa A furttier 
10 normalization step can be taken when the samples are q>lit into a training set and 
atestset 

Clustering Me&ods: Because of data redundancy, it may be possible to 
j5nd many subsets of data that provide a reasonable separation.. To analyze the 
results, the relatedness of the data should be understand. 

IS In correlation methods, the rank order characterizes how correlated the 

data is wifli the separation. Generally, a id^y ranked data point taken alone 
provides a better separation than a lower ranked data point It is therefore 
possible to set a threshold, e.g., keep only the top ranked data points, that 
separates **highly informative data points" from 'less informative data points". 

20 Feature selection meflbods such as SVM-RFE, described below, provide 

subsets of data that are both smaller and more discriminant The data selection 
method using SVM-RFE also provides a ranked list of data. With this list, nested 
subsets of data of increasing sizes can be defined. However, die fact that one data 
point has a higher rank than anotho: data point does not mean that this one &ctor 

25 alone charac{^^c^ the better separation. Tn &ct, data that are eliminated in an 
early iteration could well be very informative but redundant widi others that were 
kept Data ranking allows for a building nested subsets of data diat provide good 
sqparadons, however it provides no information as to how good an individual 
data point may be. 
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Data ranking alone is insufficient to characterize which data pomts are 
informative and which ones are not, and also to determine which data points are 
complemmtary and \^ch are redundant. Therefore, additional pre-processiag in 
the form of clustering may be q)propriate. 
5 Feature ranking is often combined with clustering. One can obtain a 

ranked list of subsets of equivalent features by ranking the clustas. hi one such 
method, a cluster can be replaced by its cluster cent^ and scores can be coniputed 
for the cluster center. In another method, the features can be scored individually 
and the score of a cluster computed as tiie average score of the features in that 
10 cluster. 

To overcome the problems of data ranking alone, the data can be 
preprocessed with an unsupervised clustering method. Using the QTciust ("quality 
clustering algorithm") algorithm, which is known in the art, particularly to those 
in the field of analysis of gene expression profiles, or some otiier clustering 

IS algorithm such as hierarchical clustering or SVM clustering, data can be grouped 
according to resemblance (according to a given metric). Cluster centers can then 
be used instead of data points themselves and processed by SVM-RFE to produce 
nested subsets of cluster centers. An optimum subset size can be chosen with the 
same cross-validation method used before. 

20 Supervised clustering may be used to show spea&c clusters that have 

relevance for the specific knowledge being determined. For example, m analysis 
of gene expression data for diagnosis of colon cancer, a very large cluster of 
genes has been found that contained muscle genes that may be related to tissue 
composition and may not be relevant to the cancer vs. normal separation. Thus, 

25 these genes are good candidates for elimination firom consideration as having 
Uttle bearing on the diagnosis or prognosis for colon cancer. 

Feature Selection: The probl^ of selection of a small amount of data fit>m a 
large data source, such as a gene subset fiom a microarray, is particularly solved 
30 usmg the methods, devices and systems described herein. Previous attempts to 
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address tins problem used correlation techniques, i.e., assigning a coefficient to 
the straigth of association between variables, hi examining genetic data to find 
detecminatiye genes, these methods eliminate gene redundancy automatically and 
yield better and more compact gene subsets. The methods, devices and systems 
5 described herein can be used with publicly- available data to find relevant 
answers, such as genes determinative of a cmca: diagnosis, or with specifically 
graerated data. 

The score of a feature is a quantity that measures the relevance or 
usefulness of that feature (or feature subset), with a larger score indicating that 

10 the feature is more useful or relevant The problem of feature selection can only 
be well defined in light of the purpose of selecting a subset of features, Exsmples 
of feature selection problems that differ in their purpose include designing a 
diagnostic test that is economically viable. In this case, one may vdsh to find the 
smallest number of features that provides the smallest prediction error, or 

15 provides a prediction error less than a specified threshold. Another example is 
that of finding good candidate drug targets. The two examples differ in a number 
of ways. 

In diagnosis and prognosis probl^ns, the predictor cannot be dissociated 
fix>m the problem because the ultimate goal is to provide a good predictor. One 

20 can refer to the usefulness of a subset of features to build a good predictor. The 
expected value of prediction error (the prediction error computed over an infinite 
number of test samples) would be a natural choice to derive a score. One 
problem is to obtain an estimate of the expected value of the prediction error of 
good precision by using only the available data Another problem is that it is 

25 usually computationally impractical to build and test all the predictors 
corresponding to all possible subsets of features. As a result of these constraints, 
one typically resorts to use of sub-optimal scores in the search of good featuro 
subsets. 

In drug target selection, the predictor is only used to substitute the 
30 biological organism under study. For exsoaple^ choosing subsets of genes and 
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building new predictors are ways of substituting computer expenments for 
laboratoiy experiments that knock out genes and observe the consequ^ce of the 
phenotype. The goal of target selection is to determine \^chfeature(s) have the 
greatest impact on the health of the patient The predictor itself is not going to be 
5 used. One refers to the '""relevance" of the feature(s) with respect to tiie condition 
or phenotype under study. It may be a good idea to score features using multiple 
predictors and using a combined score to select features. Also, in diagnosis and 
prognosis, correlated features may be substituted for one another. The fact that 
feature correlations may mean causal relationshq)S is not significant. On the 

1 0 other hand, in target selection, it is much more desirable to select tiie feature that 
is at the source of a cascade of events as opposed to a feature that is further down 
on the chain. For these reasons, designing a good score for target selection can be 
a complex problem. 

hi order to compare scores obtained from a number of difTerent sources , 

IS and to allow simple score arithmatic, it is usefiil to normalize the scores. (See 
above discussion of pre-processing.) The ranking obtained with a ^ven score is 
not affected by applying any monotonically increasing function. This includes 
exponentiation, multiplication or division by a positive constant, and addition or 
subtraction of a constant. Thus, a wide variety of normalization schemes may be 

20 applied. 

As an example, the following considers conversion of scores into a 
quantity that can be interpreted as a probabihty or a . degree of belief that a given 
feature or feature subset is "good". Assume that a given method generated scores 
for a family of subsets of features. Such family may include: all stagle features, 
25 all feature pairs, all possible subsets of features. Converting a score to a 
probabiUty-like quantity may include exponentiation (to make the score positive), 
and normalization by dividing by the sum of all the scores in the family. 

In the following, P(/}, /?, fn) denotes the score normalized as 
probabiUty for the feature subset (/J,^, ...,/«). Then, I/2, is the score 
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nonnalized as probability of feature /} given fhat features ...,/^) have already 
been selected 

Scores converted to probabilities can be combined according to the chain 
rule (P(/}>/2, ...,/.) = Pifi \f2. ..../n) P(f2, ...,/»).) or Bayes rule (PCry,/^, ...,/„) = 
5 ^iP(fh f2y y fn I Q) P(Q), whCTc Q could be various means of scoring using 
difT^rent experimental data or evidence and P(C/) would be weig^ measuring 
the reliability of such data source (S^P(C|) = 1). 

Scoring a large number of feature subsets is often computationally 
impractical. One can attempt to estimate the score of a larger subset of features 
10 from the scores of smaller subsets of features by making indq>endence 

assranptions, i.e., P^fufi, .^Jn)=P(fd P(fd ^(ft)-- Or, if feore are scores for 

pairs of features, scores for triplets can be derived by rq)lacing Pifhfiyfd = P(fu 
h I/3) P(fH = PifiJs \fiiPif2) P{f2.h \fj)Pm with P(fij2yf3) - m(P(fij2 \fs) 
Pifs) + P{fhf3\f2)P(f2) + P(f2yf3\fdP(fi)y Other scores for large numbers of 
IS features can be derived from the scores of small numbers of features in a similar 
manner. 

One of the simplest stmctures for representing alternative choices of 
features is a ranked list The features are sorted according to their scores such 
that the most promising features according to that score is top ranked and die 

20 least promising features are ranked lowest. The opposite order is also possible.. 
Scores include prediction success rate of a classifier built using a single feature; 
absolute value of the weights of a linear classifier; value of a correlation 
coef&cient between the feature vector and the target feature vector consisting of 
(+1) and (-1) values corresponding to class lables A or B (in a two class 

25 problem). Correlation coefficients include the Pearson correlation coefficient; 
value of the Fisher criterion in a multi-class problem. 

It is often desirable to select a subset of features that conq)Iement each 
30 other to provide best prediction accuracy. Using a ranked list of features, one can 
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raxik subsets of features. For example^ using scores nonnalized as probabilities 
and making feature independence assunq[>tions, the above-described chain rule 
can be applied 

Independence assumptions are often incorrect. Methods of forward 
5 features selection or backward elimination (bcluding RFE» discussed below) 
allow the construction of nested subsets of complementary features Fy c: <= ... 
Fnt using a greedy search algorithm that progressively adds or removes features, 
for scores nonnalized as probabilities, the chain rule ^plies. For example, 
assume Fy = \fa} and F? = \fajb}^ The relationship P{Fi) = P(Fi)P{F2 \Fi) = 

10 P(fa)P(fa/b\fa) is a forward selection scheme where can be added once^^ has 
been selected with ttie probability P{fafb of making a good choice. Similarly, 
if it is assumed that F^/ - (fa, A J0 Pm = (fajb. to P{Fn^i) = 

P(F«,)P(F^y \Fnd = PifaJb> -^fjJkWa.A --Jjlfajb. fhB^ This can be read 
in a backward elimination scheme as: eliminate when the remaining subset is 

15 {fa.fiH.^fjy fkh with probability P(fa, ft, fs \fa. fb. ...J}, fk) of making a good 
choice. 

Alternatively, one can add or remove more than one feature at a time. 
(See detailed description of KFE below.) As an example, RFE-SVM is a 
backward elimination procedure that uses as the score to rank the next feature to 

20 be eliminated a quantity that approximates the difference in success rate S{Fjn^i) - 
S{F,j), Scores are additive and probabilities multiplicative, so by using 
exponentiation and normalization, the score difference can be mapped to 
P{Fn.i I Fto) = I because of the backward elimination procedures. Since P(F|„) is 
proportional to exp(5„,), F(F^y | F^) - F(F„ ( F«./)P(F«.y /F(F^) = exp(5'^y/S'„). 

25 In a manner simlar to that described for ranked lists of subsets of equivalent 
features, nested subsets can be constructed of complemrataiy subsets of 
equivalent features. Clustering can be used to areate "super features" (cluster 
centers). The nested subsets of super features define nested subsets of subsets of 
equivalent features, i.e., the corresponding clusters. 
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In an alternative method, nested subsets of conq)lementaiy features can be 
constructed using cardinality increment of one. The first few subsets are kept, 
then the remaining features are aggregated to the features in the nested subset, in 
other words, the features in the nested subset are used as cluster centers, then 
5 clust^ are formed around those centers with fte remaining features. One 
application of such structures is flie selection . of alternate subsets of 
complementary features by rq>lacing the cluster centers in a subset of cluster 
centers with one of the cluster members. 

Nested subsets of complementary subsets of equivalent features may 
10 produce alternate complementary subsets of features tiiat are sub-optimal. Trees 
can provide a better alternative for representing a large number of alternate nested 
subsets of complementary features. Each node of the tree is labeled wifli a 
feature and a feature subset score. The children of the root node represent 
alternate choices for the jBrst fieature. The children of the children of the root are 
15 attomate choices for the second features, etc. The path firom the root node to a 
given node is a feature subset, the score of which is attached to fliat node. 

The score for siblings is the score of the subset including the child feature 
and all its ancestors. For scores normalized to probabilities, sorting of the 
siblings is done according to the joint probability P{ancestors, child). Givra that 
20 siblings share the same ancestors, such sibling ranking also corre^nds to a 
ranking according to P[child\ ancestors). This provides a ranking of alternate 
subsets of featines of the same size. 

Tree can be built with forward slection algorithms, backward selection 
algorithms, exhaustive feature subset evaluation or other search strategies. Trees 
25 are structures that generalizes both ranked lists and nested subsets of features. A 
tree of depth one is a ranked list (of all children of the root) A tree that has only 
one branch defines nested subsets of features. One can also build trees of sup^- 
features (cluster centers) and, fh^fore^ obtain a structure, that contains multiple . 
alternative of nested subsets of subsets of equivalrat features. Another variant is 
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to build a tree using only the top features of a ranked list Subsequently, the 
features eliminated can be aggregated to the nodes of the tree they most resemble. 

Other graphs, particularly other kinds of directed acycUc graphs, may have 
some relevance to desoibe subsets of features. For example, Bayesian networks 
5 have been used to describe relationships betweai genes. 

For some features, e.g., gqnes, one can obtain patterns &om various 
sources. Assume that one wishes to assess the relevance of certain genes with 
respect to a given disease. Gene scores (or gene subsest scores) can be derived 
fiom DNA microarray gene expression coefficients for a variety of diseased and 

10 normal patients. Other scores can be obtained from protein arrays, and still other 
scores can be obtained by correlation of the citation of various genes with the 
given disease from published medical articles. In each case, a feature subset data 
structure can be constructed. These structures can then be combined to select 
feature subsets based oii the combined information. 

15 When combining information from different data sources, ranked hsts are 

typically the easiest to combine. For a ranking of n features (or feature subsets) 
from two diffi^nt sources, let iS*;, S2, ...» Sn be the scores for the first source and 
S*h S'29 ...» S'n be the scores for the second source. If the scores used to rank the 
two list are commensurate, then a new ranked list can be created using a 

20 combined score for every feature. The combination can be, for example, additive 
{Si +S'u S2 +S'2> S„ ^S'„) or multiplicative (SjS'j, S2S'2, SJS'„y Different 
types of score combinations yield different rankings. If the scores are not 
commensumte, they can be replaced by scores having no dimension. In this case, 
the rank of the features in the two Usts are Si , which is the rank of feature subset i 

25 in the Ust ranked according to i5^,/=l.../i, and S'i , which is the rank of feature 
subset / in the list ranked according to iS },/=!. ../i. 

All of the above-proposed schemes can be trivially generalized to 
combinations of more than two scores. 

Scores can be made commensurate by normalizing them to probability- 

30 like scores. If the scores are probability-like quantities, additive combinations 
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might be interpreted as voting according to die confidence P(sourcei) that one has 
in the varioiis soiurces of information: Pifeature subseti) = Pifeature 
subseti I sourcei) Pisourcei). 

If not all of the feature subsets scored using the data fiom the first source 
5 are scored using the data fix>m the second source^ tfaey can still be combined by 
using fhe intersection of the subsets scored. Alternatively, one can complete Ihe 
missing scores is each list explicitly by computing them, or approximate by using 
independence assumptions. 

Ranked liste and nested subsets of features can be composed rather than 

10 combined One can fib^ rank the features according to a fiven score, selecting the 
top ranked features, then create a nested subset of features on the remaining 
subset Such combination of oqpration makes sense in cases where a given 
nested subset of features algorithm is prone to ^'overfif the data, i.e., choose 
combinations of features that have a small prediction error on training examples, 

15 but a large prediction error on new test examples. Using a feature ranking 
algorithm first may reduce the risk of overfitting by eliminating fliose features 
that are poorly correlated to the target and coinddentally complement each otiier. 

Trees and ranked lists are easily combined using method for combining 
ranked lists. One uses the fact that fhe siblings in a tree constitute a ranked list, 

20 the score of each node being the score of all the features fix>m die root to the 
given node. That ranked list can be combined with another ranked list that gives 
scores for the same feature subsets, A new tree is built with fhe combined scores 
and the siblings are ranked again accordingly. lists containing sussing values 
can be truncated or completed. 

25 Combining trees can be done in a similar way. This amounts to 

combining the scores of the &ture subsets found in the tree, by completing 
nussing values if necessary, and re-ianking the siblings. 

In the case where scores are probability-like, the scores used to rank fhe 
siblings P(child=featurei I ancestorSy saurcej) in a tree built fiiom data source 1 

30 can be combmed with die scores used to rank die siblings 
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P{child^eaturei\ ancestors^ sourcei) in a tree built from data source 2 as 
P{child'=featurei \ ancestors) = P{dtild=featurei \ ancestors^ sourcei) P(sourcej) + 
P{child=featurei \ ancestors^ sowrce2iP{source2)> 

There is a particular case in which the second tree is built fix>m scores 
5 obtained by making independence assumptions, Pifufi, ft) = Pif^Pifzi^^JPifg)' 
In that case, P{child=featiirei | ancestors^ sourcei) Pifeaturci \ source^). 

Other methods of combining structurese of relevant features will become 
apparent to those of skill in the art based on the preceding examples. 

In one mettiod of feature selection, a pre-processing operation may 
10 involve tibe use of expert knowledge to eliminate data that are known to 
coixq)licate analysis due to the difficulty in diff^ientiating the data fix)m other data 
that is known to be useful, hi the colon cancer exaiqple used above, tissue 
conQX>sition-relat6d genes were automatically eliminated in the pre-processing 
step by searching for the phrase "^ooth muscle** in. the description of the gene. 
IS Other means for searching the data for indicators of the smooth muscle gmes 
may be used. 

Feature SelectioD by Recursive Feature Elimination . 

While the illustrative examples are directed at gene expression data 

20 manipulations, any data can be used in the methods, systems and devices 
described herein. There are studies of gene clusters discovered by unsiq}ervised 
or supervised learning techniques. Preferred methods comprise plication of 
SVMs in determining a small subset of highly discriminant genes that can be used 
to build very reliable canc^ classifiers. Identification of discriminant genes is 

25 beneficial in confitrming recent discoveries in research or in suggesting avenues 
for research or treatment. * Diagnostic tests that measure the abundance of a given 
protein in bodily fluids may be derived fix)m flie discovery of a small subset of 
discriminant genes. 

In classification methods using SVMs, the input is a vector referred to as a 

30 **pattem** of n components referred to as "features**. P is defined as (he n- 
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dimotisioiial feature space. la the examples given, file features are g^e 
e3qxression(X>efficients and the patteim correspond to padeats. While the present 
discussion is directed to tv/o-class classification prohlems, this is not to limit the 
scope of the invention. The two classes are identified with the symbols (+) and 
5 (-). Atrainingsetof anumberofpatterns {xuX2,....Xjk, ...JCf} withknownclass 
labels {yi, j^, ...yk, ^^-yt }>yk e {-1,+1}, is given. The training patterns are used 
to build a decision function (or discriminant fimction) D(x)y that is a scalar 
function of an input pattern x. New patterns are classified according to the sign 
of the decision fimction: 
10 D(x) > 0 X e class (+); 

D(x) <0=^xe class (-); 

D(x) = 0, decision boundary; 

where € means "is a member of*. 

Decision boundaries that are simple weighted sums of the training patterns plus a 

1 5 bias are referred to as "linear discriminant functions", e.g., 

D(x)-wx + b, (1) 
where w is the weight vector and b is a bias value. A data set is said to be 
linearly separable if a linear discriminant function can sq;>arate it without oror. 

Feature selection in large dimensional input spaces is performed using 

20 greedy algorithms and feature lankingl A fixed number of top ranked features 
may be selected for further analysis or to design a classifier. Alternatively, a 
threshold can be set on the ranking criterion. Only flie features whose criterion 
exceed the threshold are retained. A preferred method uses the ranking to define 
nested subsets of features, Fi c jF2 c . . .c and select an optimxmi subset of 

25 features with a model selection criterion by varying a single parameter: the 
number of features. 

Errorless separation can be achieved with any number of genes greater 
than one. Preferred methods comprise use of a smaller mmiber of genes. 
Classical gene selection methods select the genes that individually best classify 
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the traiDing data. These methods include coirelatLon methods and expression 
lado methods. While the classical methods eliminate genes that are useless for 
discrimination (noise), &ey do not yield compact g^e sets because genes are 
redundant Moreover, complementary genes that individually do not separate 
5 well are missed. 

A simple feature ranking can be produced by evaluadng how well an 
individual feature contributes to the separation (eg. cancer vs. normal). Various 
correlation coefficients have been proposed as ranking criteria. For example, see, 
TJK- Golub, et al, '"Molecular classification of cancer Class discovery and class 
10 prediction by gene expression monitoring". Science 286, 531-37 (1999). The 
coefficient used by Golub et al. is defined as: 

wi =oi.<+)-i^<-)y(aK+) + o,<-)) (2) 

"i^ere |i/ and O/ are the mean and standard deviation, respectively, of the gene 
expression values of a particular gene i for all the patients of class (+) or class (-), 

15 1 = 1, ....n.. Large positive w,- values indicate strong correlation with class (+) 
whereas large negative Wi values indicate strong correlation with class (-). The 
method described by Golub, et al. for feature ranking is to select an equal number 
of genes with positive and with negative correlation coefficient Other methods 
use the absolute value ofwi as ranking criterion, or a related coeffidoot, 

20 OiX+)-M.{-))'/(a.<+)^ + aK-)^ (3) 

What charact^izes feature ranking with correlation methods is the 
implicit orthogonality assumptions that are made. Each coefficient Wf is 
computed with information about a single feature (gene) and does not take into 
account mutual ioformation between features. 

25 One use of feature ranking is in the design of a class predictor (or 

classifier) based on a pre-selected subset of genes. Each feature which is 
correlated (or anti-correlated) with the sq)aration of interest is by itself such a 
class predictor, albeit an imperfect one. A single method of classification 
comprises a method based on weighted voting: the features vote in proportion to 
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fheir coirelation coefficient Such is the method used by Golub, et al. The 
weighted voting scheme yields a particular linear discriminant classifier 

D(x) = w.(x-ji), (4) 
where w is = Ot,<+) - M.<-))/(o/(+) + <J.<-)) and \i =ai(+) + K-))/2 
5 AnoflieT classifier or class predii^r is Fisher's liziear disciimiii^ Sudi 

a classifier is similar to ttiat of Gohib et al. M^ere 

w = 5-»0i(+) + K-)). (5) 
where S is the (n.n) within class scatter matrix defined as 

10 where p, is the mean vector over all training patters and X(+) and X(-) are the 
training sets of class (+) and (-), respectively. This form of Fisher's 
discriminant implies that S is invertible, however, this is not the case if the 
nxraiber of features n is larger than the number of examples £ since the rank of S 
is at most The classifiers of Equations 4 and 6 are similar if the scatter matrix 

15 is approximated by its diagonal elements. This approximation is exact when the 
vectors formed by the values of one feature across aU training patterns are 
orthogonal, after subtracting the class mean. The q)proxiination retains some 
vahdity if the features are uncorreclated, that is, if the expected value of the 
product of two different features is zero, after removing the class mean. 

20 .^)proximatLng S by its diagonal elements is one way of regularizing it (making it 
invertible). However, features usually are correlated and, therefore, the dia^nal 
approximation is not vaUd. 

One aspect of the present invention comprises using the feature ranking 
coefficients as classifier weights. Reciprocally, the weights multiplying the 

25 inputs of a given classifier can be used as feature ranking coefficients. The iiq)uts 
that are weighted by the largest values have the most influence in the 
classification decision. Therefore, if the classifier performs well, those inputs 
with largest weights correspond to the most informative features, or in this 
instance, genes. Other methods, known as multivariate classifiers, comprise 
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algorithms to train linear discriminant functions that provide superior feature 
ranking compared to correlation coefficients. Multivariate classifiers, such as the 
Fisher's linear discriminant (a combination of mult^le univariate classifiers) and ' 
methods disclosed herdn, are optimized during training to handle multiple 
5 variables or features simultaneously. 

For classification problems, the ideal objective fimction is the expected 
value of the error, i.e., the enror rate computed on an infinite number of exanq)les. 
For training purposes, this ideal objective is replaced by a cost fimction J 
computed on training examples only. Such a cost fimction is usually a bound or 
10 an approximation of the ideal objective, selected for convenience and elBSciency. 
For linear SVMs, the cost fimction is: 

which is minimized, imder constraints, during training. The criteria (wf)^ 
estimates the effect on the objective (cost) fimction of removing feature i. 
15 A good feature ranking criterion is not necessarily a good criterion for 

ranldng feature subsets. Some criteria estimate the effect on the objective 
fimction of removing one feature at a time. These criteria become suboptimal 
when several features are removed at one time, which is necessary to obtain a 
small feature subset 

20 Recursive Feature Elimination (RFE) methods can be used to overcome 

this problem. RFE methods comprise iteratively 1) training the classifier , 2) 
conq>uting the ranking criterion for all features, and 3) removing the feature 
having the smallest ranking criterion. This iterative procedure is an example of 
backward feature elimination. For computational reasons, it may be more 

25 efficient to remove several features at a time at the expense of possible 
classification performance degradatioiL In such a case, the method produces a 
"feature subset ranking", as opposed to a "feature ranking". Feature subsets are 
nested, e.g., Fi c F2 c= . . . c F. 

If features are removed one at a time, this results in a corresponding 

30 feature ranking. However, the features that are top ranked, i.e., eliminated last, 
are not necessarily the ones that are individually most relevant. It may be the case 
that the features of a subset are optinfial in some sense only when taken in 
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some combinatioit RKB has no effect on correlation methods since the ranking 
criterion is computed using infoimation about a single feature. 

In Hxe present embodiment, the weigjits of a classifier are used to produce 
a feature ranking with a SVM (Stqjport Vector Machine). The present invention 
5 contemplates methods of SVMs used for both linear and non-linear decision 
boundaries of arbitrary conq)lexity, however, the example provided herein is 
directed to linear SVMs because of the nature of the data set under investigation, 
linear SVMs are particular linear discriminant classifiers. (See Equation 1). If 
the training set is linearly separable, a linear SVM is a TnaYimnm margin 

10 classifier. The decision boundary (a straight line in tiie case of a two-dimension 
separation) is positioned to leave the largest possible margin on either side. One 
quality of SVMs is that the weights hv of the decision function D(x) are a 
fimction only of a small subset of the training examples, i.e,, "support vectors". 
Siq[>port vectors are the examples that are closest to the decision boundary and lie 

15 on the margin. The existence of such support vectors is at the origin of the 
computational properties of SVM and its competitive classification performance. 
While SVMs base their decision function on the support vectors that are the 
borderline cases, other methods such as the previously-described method of 
Golub^ et al., base the decision function on the average case. 

20 A preferred method of the present invention comprises using a variant of 

the soft-margin algorittun where training comprises executing a quadratic 
program as described by Cortes and V^nik in "Support vector networks", 1995, 
Machine Learnings 20:3, 273-297, which is incorporated herein by reference in 
its mtirety. The following is provided as an example, however, different 

25 programs are contemplated by the present invention and can be determined by 
those skilled in the art for the particular data sets involved. 

Inputs comprise training examples (vectors) {xi, xi,....Xifc...Xf} and class 
labels {yi, y2....yifc...yf}. To identify the optimal hypeiplane, the following 
quadratic program is executed: 



30 



Minimize over a,^ : 

/ = (l/2)^y,y,a,ajt(x, -x^ +^ftt)-2a^ 
subject to : 

O^a^ <Cand2fl:4;^jt=0 
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wi& the resultiiig outputs being tiie parameters oc^., wh^ the summations run 
over all training patterns xjt that are n dimensional feature vectors, * Xik dmotes 
the scalar product, encodes the class label as a binary value =1 or -1, 5^ is flie 
Kronecker symbol (S/ut = 1 if h = k and 0 oflierwise), and X and C are positive 
5 constants (soft margin parameters). The soft margin parameters ^isure 
convergence even when the problem is non-linearly separable or poorly 
conditioned. In such cases, some support vectors may not lie on the margm. 
Methods include relying on X or C, but preferred methods, and those used in the 
Examples below, use a small value of X (on the order of 10"^^) to ensure 
10 numerical stability. For the Examples provided hereiD, the solution is rather 
insensitive to the value of C because the training data sets are linearly separable 
down to only a few features. A value of C = 100 is adequate, however, other 
methods may use other values of C. 

The resulting decision function of an vapid vector x is: 

15 

I>(x) = wx+6 

with (9) 

The weight vector w is a linear combination of training patterns. Most weights a* 
are zero. The training patterns with non-zero weights are support vectors. Those 
having a weight that satisfies the strict inequaUty 0 < a* < C are marginal support 
20 vectors. The bias valued is an average over marginal support vectors. 

The following sequence illustrates ^plication of recursive feature 
elinunation (RFE) to a SVM using the weight magnitude as the ranking criterion. 

The inputs are training examples (vectors) : Xo = [xi, X2 x*...X£]^ and class 

labels Y - [yi, y2. . ..y*. . .y/f . 
25 Imtalize: 

Subset of surviving features 

s = [l,2,....n] 
Features ranked list 
r=[] 

30 Repeat until s = [ ] 

Restrict training examples to good feature indices 

X = Xo(:,s) 
Train the classifier 
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a = SVM/rawj(X,y) 
Compute the weight vector of dimension length(s): 

k 

Compute the ranking criteria 

Cj = (wi)^ for all i 
Find the feature with smallest ranking crit^on 

^ argmin{c) 
Update feature ranked list 

r=[s(f)^] 

Eliminate the feature with smallest ranking criterion 
s = s(l:f-l,f=l:length(s)) 

The output comprises feature ranked list r. 
The above steps can be modified to increase computing speed by generalizing the 
algoritimi to remove more than one feature per step. 

In general, RFE is computationally expensive when compared against 
correlation methods, where several thousands of input data points can be ranked 
in about one second using a Pentium® processor, and weights of the classifier 
trained only once with all features, such as SVMs or pseudo-inverse/mean 
squared error (MSB). A SVM iniplemented using non-optimized MaiLab® code 
on a Pentium® processor can provide a solution in a few seconds. To mcrease 
computational speed, RFE is preferrably implemented by training multiple 
classifiers on subsets of features of decreasing size. Training time scales linearly 
with the number of classifiers to be trained. The trade-off is computational time 
versus accuracy. Use of RFE provides better feature selection than can be 
obtained by using the weights of a single classifier. Better results are also 
obtained by eliminating one feature at a time as opposed to elinainating chunks of 
features. However, significant differences are seen only for a smaller subset of 
features such as fewer than 100. Without trading accuracy for speed, RFE can be 
used by removing chimks of features in the first few iterations and then, in later 
iterations, removing one feature at a time once the feature set reaches a few 
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hundreds. RFE can be used when the number of features, e.g., genes, is increased 
to millions. In one example, at the first iteration, the number of genes were 
reached that was the closest power of two. At subsequent iterations, half of tibe 
remaining genes wore eliminated, such that each iteration was reduced by a power 
5 of two. Nested subsets of genes w&ct obtained that had increasing mfoimation 
density. 

RFE consistently outporfonns the naive ranking, particularly for small 
feature subsets. (The naXve ranking comprises ranking ttie features with iwi)\ 
which is computationally equivalent to the first iteration of RFE.) The naive 

10 ranking organizes features according to their individual relevance, while RFE 
ranking is a feature subset ranking. The nested feature subsets contain 
complementary features that individually are not necessarily the most relevant. 
An important aspect of SVM feature selection is that clean data is most preferred 
because outtiers play an essential role. The selection of usefiil patterns, support 

1 5 vectors, and selection of usefiil features are connected. 

In addition to the above-described linear example, SVM-RFE can be used 
in nonlinear cases and ofhsr kmiel methods. The method of eliminating features 
on tiiie basis of the smallest change in cost fimction can be extended to nonlinear 
uses and to all kernel methods in general. Computations can be made tractable by 

20 assuming no change in the value of the a's. Thus, the classifer need not be 
retrained for every candidate feature to be eliminated. 

Specifically, in the case of SVMs, the cost fimction to be minimized 
(under the constraints 0 < < C and Jl/fiLkYk = 0) is: 



25 where H is the matrix with elements j%7kKi^h^k)y K is sl kernel fimction that 
measures the similarity between Xi, and Xjt>» 1 is an f dim^isional vector of 
ones. 

An example of such a kernel fimction is 



(10) 



(11) 
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To compute the change in cost function caused by removing input 
component i, one leaves the a's unchained and recomputes matrix H. This 
corresponds to computing ^(x* (-i), (-0> yielding matrix J¥(-i), where the 
notation (-1) means that component i has been removed. The resulting ranking 
5 coefficient is: 

DJ{i) = (l/2)a^ Ha-(l/2)a^H('i)a (12) 
The input corresponding to the smallest difference DJ(i) is then rmioved The 
procedure is repeated to cany out Recursive Feature Elimination (RFE). 

A method for predicting the optimum subset of data can comprise 
10 defining a criterion of optimality that uses information derived from training 
examples only. This criterion can be checked by detemiining whether a predicted 
data subset performed best on the test set 

A critoion that is often used in similar **model selection** problems is the 
leave-one-out success rate Vsuc- hi some cases, it may be of Uttle use since 
1 5 differentiation betwefen many classifiers that have zero leave-one-out error is not 
allowed. Such differentiation is obtained by using a criterion that combines all of 
the quahty metrics computed by cross-validation with tiie leave-one-out method: 

Q = Vsuc + Vacc + V«t + Vn«l (13) 

^ere Vsuc is the success rate, Vacc tiie acceptance rate, Vext the extremal margin, 
20 and Vmed is the median margin. 

Theoretical considerations suggest modification of this criterion to 
penalize large gene sets. The probability of observing large differences between 
the leave-one-out error and the test error increases with the size d of the data set, 
according to 

25 4d) = sqrt(-log(a)+log(G(/0)) • sqit(p{l-p)/n) (14) 

where (1-a) is the confidence (typically 95%, Le., a = 0.05),/? is the "true" error 
rate (p<=0.01), and n is the size of ttie training set 

Following the guaranteed risk principle, a quantity proportional to ^d) 
was subtracted firom criterion Q to obtain a new criterion: 

30 C = G-2^d) (15) 
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The coefficient of pioportioiiality was computed heuristically, assuming 
that Vsuc, Vacc> Vcxt and Vj^d 3re indq)endent random variables with the same 
error bar ^d) and that this error bar is commensinrate to a standard deviation. In 
this case, variances would be additive, therefore, the error bar should be 
5 multiplied by sqrt(4). 

The leave-one-out metiiod with the classifier quality critmon can be used 
to estimate the optimum number of data points. The leave-one-out method 
comprises taking out one example of the training set Training is then performed 
on the remaining exanq)les, with the left out example being used to test the 
10 trained classifier. This procedure is iterated over all the examples. Each criteria 
is computed as an average over all examples. The overall classifier quality 
criterion is calculated according to Equation 13. The clas^er is a linear 
classifier with hard margin. 

Results of the SVM-RFE as taught herein show that , at the optimum 
15 predicted by the method using training data only, the leave-one-out error is zero 
and the test performance is actually optimum. 

According to the present invention, a number of different methods are 
provided for selection of features in order to train a learning machine using data 
that best represents the essential information to be extracted from tiie data set 
20 The inventive methods provide advantages over prior art feature selection 
methods by taking into account the interrelatedness of the data, e.g., multi-label 
problems. The features selection can be performed as a pre-processing step, prior 
to training the learning machine, and can be done in either input space or feature 
space. 

25 

Data Cleaninp- Data cleaning is the problem of identiJfying mislabeled or 
meaningless data points. Removal or correction of mislabeled or meaningless 
data can inq>rove tiie performance of the classifier, and in the case of feature 
selection algorithms, leads to more meaningfiil features. In the case of removal, 
30 an automatic data cleaning operation can be performed to remove bad features. 
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Where coirection is desired, some form of intervention is required to contact the 
source of the data (tibie "data coUectofO to confirm ttiat the data that was received 
was properly transmitted, or to pron^t the provider to review and possibly repest 
fbo test to generate replacement data. 
5 , A number of algorithms may be used to identify mislabeled patterns, hi 
the preferred embodiment, the algorithms are of tiie following form: they are 
given as input a training set 5/ and the ou^ut a rank r,- for / data points. The rank 
indicates the likelihood of being an outKer, with greater likelihood being assigned 
a lower rank. 

10 Measuring the success of a data cleaning algorithm, even when there is 

only a single mislabeled point, is not obvious. Taking the main rank of the 
(single) noislabeled points is an obvious measure, however, it can pay too much 
att^tion to a few large scores. For example, if one algorithm predicts tiie 
mislabeled points on 9 out of 10 runs wilfa rank 1, but on the other run assigns a 

15 rank of SO to the mislabeled point, it obtains a mean of S.9. If another algorithm 
always gives a rank of 5, however, it appears that the first algorithm is more 
useful for detecting outliers. If one is referring to the first one or two points to 
ibe data coUector for verification, then the first algorithm will find outliers 
whereas the second will not 

20 The following measurements of eiror are recorded over n runs: 1) 

number of mislabeled points with rank 1, 2 and 3 (three separate scores); 2) mean 

1 

rank of the mislabeled points —/r.y where is the iadex of the mislabeled 

point in run i; 3) trimmed mean — ^min(c,r ), where pi is the index of the 

mislabeled point in run L That is, one takes the mean rank of the mislabeled 
25 points, but for ranks greater that c, their contribution to the mean is decreased. 
This serves to concentrate the score on the first few rankings. 

The test set-iq) used was as follows: Iq toy data, 100 training sets of fixed 
size were randomly drawn. The label of a single training example in each set was 
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flipped to create the mislabeled point. The data cleaning algorithms were scored 
according to the ranking of the mislabeled example in each set, then the mean 
score was taken. M real datasets that are of small size, e.g„ microarray data, 
independent training sets cannot be randomly drawn. la this case, the label of 
5 each exaiiq)le is fUpped in turn so that one has /copies of the original trai^ 

each having a single mislabeled point 

The following algorithms were conq)ared: SVM-^ which records the 
distance from the "correcf ' side of the margin ^ of each training point; S VM-<x, 
which records the size of the weigbt ct/ of each training ponit; SUB-ERR, which 
10 sub-samples the data many times, each time training a SVM and recording if each 
test point is mislabeled by the algorithm or not; SUB-a, which sub-samples many 
time and records the weight of each training point; SUB-^, which sub-samples 
many time and records the distance from the correct side of the margm; LOO-J^, 
which performs leave-one-out, each time training a SVM and recording the 
15 distance from the correct side of the margin; L00-»^, which performs leave^one- 
out, each time training a SVM and recording flie size of the margin, which is 
inversely proportional to W^; FLSP-jf , which flips the label of each example, 
each time training a SVM and recording the size of the m argin; and GRAD-C, 
which assigns a variable C, for each training point and minimizes F?V^ using the 
20 'tidge trier. 

The SVM-^ algorithm assigns tiie ranking by the distance of each point 
from the correct side of the margin, with the largest ranked first according to the 
following: 

1. choose a value ofthe soft margin parameter C; 
25 2. train the classifier [w,b] = SVM(iS/, Q; 

3. calculate4f=l-l-j',<wXf + 6)foralli; 

4. assign = carrf {^ :^ >^} for all 1. 

The SVM-a algorithm uses weights where, if a point is easy to classify, 
its weight is zero. The more unusual the point, the larger the weight The 
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ranking is thus given by ttie weight of each point, with the largest ranked first, 
according the steps: 

1 . choose a value of the soft margin parameter C; 

2. train the classifier [a, 6] = SVM--DUAL(iSi,C); 
5 3. assign r,= cor^f {ay : o/ ^ Oj} for all I. 

The SUB-ERR algorithm assigns ranking according to the average 
number of mislabehngs, with the most mislabeled ranked first according to the 
sequence: 

1. choose a vahie of the soft margin parameter C, the number of sub- 
10 sample runs p and the sub-sampling size g; 

2. initialize e,- = 0 and i/,= 0 for all /; 

3. FORi = lTO;7runs 

— draw a random sub-sample of the training data of size q with indexes 



15 — let the remainder of the data have indexes tstj^j 

- train a classifier [w,fe]= SVM(^x^.,y^.)}y^^ ^ yC); 
-assign u^^ = + 1 for all j; 

-assign e^^ =e^^ -sign(w x^^ foraUy; 

4. assign =cardi^^^ foralli. 

20 Using the SUB-a algorithm, the ranking is given by the average weight of 

each point, with the largest value being ranked first, using the following: 

1. choose a value of the soft margin parameter C, the number of sub- 
sample runs p and the sub-sampling size 9; 

2. initiaUzee/==Oandii| = OforaIlz; 
25 3. FORi = lTO;?runs 

— draw a random sub-sample of the training data of size q with indexes 

^J-i g'y 

- let the remainder of the data have indexes tstj=.j 
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-train a classifier [a,fc]=SVM-DUAL|^x^,3;^)}^^^^ ,c); 
-assign u^^ =««^ +1 foranj; 
-assign e^,^ = e^^ + Uj for aUy; 

4. assign rg^cardi^J/^ '^/^' ^ 

5 The SUB-^ algorithm assigns ranking by the av^ge distance of each 

point firom the "coirecr side of the margin, witti the largest distance ranking first, 
according to the following: 

1. choose a value of the soft margin parameter C, the number of sub- 
sample runs p and the sub-sampling size q; 
10 2. initialize 0 and i/|= 0 for all /; 

3. F0R*=1 TO/7runs 

— draw a random sub-sample of the training data of size q with indexes 

- let ^er^nainderofthe data have indexes t^^z^y i^; 

15 - train a classifier [w,6]= SVM({(x^,7^.)}y^^ ^ 

-assign u^,^ = + 1 for aUy; 

-assign e^,^ =e^^ +l-y^^(w-x^,^ +6j for ally; 

4. assign t; = cardi^J/^ : ^J/^ > ^j^ j for all i. 

The LOO-^ algorithm assigns rank by the distance of each left out point 
20 fi^m the "correcf * side of the margin, with the largest ranked first according to 
the sequence: 

1 . choose a value of the soft margin parameter C; 

2. FORi = lTO/ 

-train the classifier [w,6]=SVM(5', \{(x,,y,)},C); 
25 -assign =l-j^,(w-x,. +6); 

4. assign = card^j : Cj > e.] for all i. 



wo 02/101954 



52 



PCTAJS02/19202 



The LOOff^ algorithm lanks accoiding to the size of ^ when leaving 
out each point, with the largest ranked first according to the steps: 

1 . choose a value of the soft margm parameter C; 

2. FORi = lTO/ 

5 - train the classifier [a,b]=SVM-D\JAlJ(Sf \ {(x^,x)},C); 

-assign = W^(a) ; 
4. assign = card^j : Cj ^ for all L 

The FLIP-FP^ algorithm ranks according to the size of ^ when flipping 
each point, with the largest ranked first, following the sequence: 
10 1. choose a value of the soft margm parameter C; 

2. FORi = lTOZ 

--train the classifier [a,b]=SVM'DUA^{Sf\{(x.,y,)}]\J{{x.^ 

-assign =>r^ (a); 

4. assign = card^j : < for all i. 

15 There are three other variants of the **FLlP''algOTittmi: FLIP-SPAN uses 

the span [2] as the quality meas\u:e; FLIP-DIST uses the change in distance finom 
tiie margin of the flipped point FLIP-VALID uses a validation set instead, 
summing the errors using a sigmoid on the distance fiom the margin of type 
l/il-^exp(3x)y 

20 The GRAD-C algorithm nmmmzes 

sapR\QW^(a,C) 

a 

where W\a) = J^a, ^^^^a^ajy^yjix, ^x.^l^S^) 
subject to 0 < a,,]^«,.y, = 0 

and if(C) = )l[xf -'jh^^i, 

25 which can be solved by gradient descent One then assigns 
r, = cardiQ : Cj < Q) for aU z. 
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Data cleaning algorillims can be improved by use in conjimction with 
feature selection techniques. One scheme is to implement with feature selection 
algorithm directly into the classifiers. A second is to consider different features 
selection subset sizes and consider the scores across all tiie subset sizes. 
5 The problem used to compare the various data cleaning algorithms 

consists of two Gaussian distributions ia d — 100 dimmsions with variance 
1 3lsqrt{d)y one for each class label, drawing 30 training points randomly for each 
class. One hundred of the datasets were generated with the thirtieth data point 
flipped The problem is to detect that this thirtieth data point is mislabeled The 

10 various algorithms were compared iu both hard margin and soft margm settings. 
The results are provided in Table 1, where the letters after the algorithm identifier 
imply the foUowiag: C is the value of the soft margin chosen using a validation 
set 2C means two time this value (as sub-sampling involves a smaller training 
set). No C means a hard margin algorithm. In the sub-sanq>Iing algorithms, 

15 "300" after the name indicates the number of sub-samples used 
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FLIP-VALID C 


4.39 


38 


58 


72 


338 


2.56 


GRAD-C 


4.89 


38 


49 


63 


3.86 


2.81 



Table 1 



According to the results shown in Table 1, SVM-a is not one of the better 
algorithms to use. SVM-^ is simple and obvious but is, nonetheless, a reasonable 
20 algorithm. It is important to select a good value for C. Sub-sampling is generally 
very good is the numbo: of samples is large enou^ with the error rate decreasing 
with an increasing number of samples. Even sub-sampling is sensitive to the 
choice of C. However, sub-sampling still performs well as a data cleaning 
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algorithm in the soft margin case and appears to be fho best algorithm for this 
purpose. 

The LOO algorithm performed fairly well, but not as well as sub- 
sanq)ling. The flipping algorithms were not as good as expected, however, FLIP- 
5 DBTperfonnedwell. FUP-VALID actually provides the best result, however, it 
uses extra information in the validation set Peih^s using a validation set can be 
^proximated by using sub-sampling, but this slows down the algoritoi. GRAD- 
C provided good results, possibly due to the feet that the classifier found good 
values of C as part of the algorithm. 

10 

InfonnatioD Retrieval: According to the present invmtion, the relevance of 

the ou^ut of the data analysis, e.g., the identity of the genes identified as good 
predictors, is checked by comparison with documents retrieved using a different 
module, hi the exemplary embodiment for ^plication to bioinformatics, and 

1 5 particularly to analysis of gene expression data, the module for obtaining 
information firom published literature is identified as the Gene Search 
Assistant™, or "GSA". The GS A is an online tool to assist in tiie analysis and 
verification of gene-disease-organ relationships which is driven by an in-house 
database, or '*knowledge base", containing dynamically evolving analyst data and 

20 static ^'librai/* data on specific genes, diseases and organs. The analyst data 

includes references to online information sources, summaries and assessments of 
the referenced online sources, and the analysts' evolving determinations 
regarding specific gene-disease-organ triplets. For purpose of this description, 
such determinations are referred to as 'Tindings", which are described below. An 

25 analyst creates and modifies a finding using an iateractive Web-based interface. 
A finding is composed of several fields including a bibliographic reference to 
online information. The referenced online information may be a wdvbased 
article, p^er, bibliographic entry, database record, etc. An analyst creates and 
enters a bibliogr^hic reference into the knowledge base using an interactive 
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Web-based mt^:£tce. A bibiogrq)hic reference is composed of several fields that 
are described in detail below. 

For piuposes of the following description, a "database served' is a 
program that stores and retrieves infomiation on a resident database in response 
5 to commands and requests fix)m users. A **Web server" is defined as aprogram 
that displays or "serves" web pages in response to requests by wd> browsers (e.g., 
Netscape, Intemet Explorer) and other web "clients". 

The following description of a Finding is the logical construct that a user 
sees when interacting with the appUcation interface. This differs from the actual 
1 0 database design used to implement this construct With this distinction in mind, a 
Finding, which has a unique identifier for each Finding, references the following 
information: . 

1. Gene Accession Number(s) (GANs): Identifier(s) associated with the gene 

under investigatioit A Finding may optionally contain an additional field 
1 5 that will hold GANs of related but not identical genes. 

2. Gene description: The gene description is not associated with any analyst, 
but is modifiable by any analyst. If the description is changed, then the 
resulting change will be seen by all analysts. In tibefiill featured version, a 
description will be "owned" by" (associated with) either an analyst or the 

20 library, and only the owner of the description will be able to modify it Each 

analyst will by default see only his or her description. 

3. Gene keywords: In the demo, gene keywords are not associated with any 
analyst, but are modifiable by any analyst. If the keyword list is changed by 
an analyst, then the resulting change is seen by all analysts. In the fliture 

25 fiill-featured version, a keyword Ust will be "owned" by" (associated with) 

either an analyst or the library, and only the owner of a keyword list will be 
able to modify it Each analyst will by default see only his or her keyword 
Ust 

4. Disease name: The name of the disease associated with the Finding. A 
30 disease may have more tfian one name associated with it 
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5. Disease description: Same comment as for gene description. 

6. Disease keywords: Same comment as for gene keywords. 

7. Organ name: Same comment as for disease name. 

8. Organ description: Same comment as for g^e and disease description. 

9. Organ keywords: Same comment as for gene and disease keywords. 

10. Analyst's smnmacy: The analysts evolving smninary of the Find^ 

1 1 . Date/Time created: Date the Finding was initially created. 

12. Date/Time last updated: Date the Finding was last updated. 



10 The bibliogr^hic reference is a record that references an online document 
which the analyst believes is germane to a Finding. The reference contains 
multiple fields that display the location of the docimient, the analyst's 
• evaluation of the document, and the online search-path which led the analyst 
to the document A bibUographic reference contains the following fields 

15 created firom data entered by the analyst: 

1 . BibRef ID: Unique ID associated with the bibliogn^hic reference. 

2. Resource: The URL (Web address) of the infonnation resource that was 

used to search for ttiis bibref (example: http://www.yahoo.com). 

3. C^ery: The search string used to search tiie above information resource. 
20 4. Doc URL: The URL ofthe document that is being referenced. 

5. Local Copy: A local copy ofthe document is saved in an archive 
directory when a bibUogr^hic reference is created. 

6. Pub Date: The real or estimated publication date of the document 

7. Rehability: The reUabihty ofthe document as estimated by the analyst. 
25 8. Relevance: The relevance ofthe document as estimated by the analyst. 

9. Novelty: The novelty ofthe document as estimated by the analyst. 

10. Finding ID: The identifier of the Finding that was active when the 
bibUogr^hic reference was created. 

1 1 . GAN(s): The gene accession number(s) ofthe gene referenced by the 
30 associated Finding. 
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12. Gene: A unique identifier associated with the g^e refeorenced by the 
associated Finding. 

13. Disease: The disease referenced by the associated Finding. 

14. Organ: The organ referenced by the associated Finding. 

5 15. Analyst: The analyst ^o created the bibliographic reference and the 
associated Finding. 
16. Summary. The analyst's summary of this bibliogr^hic reference. 

In the preferred embodiment, the Gene Search Assistant is designed 
10 to provide controlled access to the knowledge base. Access is controUed via 
a list of analysts with passwords and login names. 

The Gme Search Assistant uses a "session" paradigm to track analyst 
activity. When an analyst logs in to the GSA, that analyst is assigned a 
unique and automatically generated session number that is terminated upon 
15 the analyst logging out of the application. A session keeps track of tiie 
analyst's login and logout time. The session information is stored in ttie 
database and tracked via a "cookie" written into the analyst's browser cookie 
file. This cookie also contains the analyst's ID number and the ID of tiie 
current Finding. 

20 The Crene Search Assistant is displayed on a single web page that is 

broken into tiiree separate areas called "firames". Referring to FIG. 9 as an 
example, the top frame 702 displays links to the difierent fimctional sections 
of the Gene Search Explorer. The central fi^ame 704 displays the currently 
selected functional section. The left frame 706, called the '^navigation bar", 

25 displays links to resources on the Web external to the site on which the GSA 
is operating. 

The top frame 702 contains links to the primary fimctional sections of 
the GSA. When a user selects a link, the chosen section is displayed in the 
central fi'ame. The primary fimctional sections are: 
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1 . Home: An introductoiy page, which also contains a login 
screen. 

2. Register: Registration page for analysts. 

3. Define Search: Allows analysts to specify the gene-disease-organ 
5 triplet which will be the subject of the current Finding. 

4. View Findings: Displays an analyst's current finding, and all 
bibliographic references associated with that finding in a top and 
bottom firame. 

5. Search Kbase: Contains a suite of options for searching the knowledge 
1 0 base. The database may be searched for Findings and bibliogr^hic 

references. 

6. Log Out: logout page - Allows the analyst to logout, then displays the 
login and logout times for the session. 

When the user chcks on a link to a resource listed in the navigation bar 
1 5 706, a separate web page is launched which retrieves and displays the referenced 
resource. The resources displayed in the navigation bar 706 are currently divided 
into three categories: 

1. General Search Engines and Resources 710: 

Ixquick: a '"meta" search engine that retrieves, combines and 
20 organizes data retrieved through a comprehensive set of on-line 

search engines 

Google: another meta search engine - uses a different link-based 
ranking strategy. 

• MSN Search: Microsoft Network's meta search engine 
25 • AltaVista: A powerful general search engine 

• Goto: A powerful general search engine with a different search 
strategy than AltaVista 

• Dejanews: A search engine that retrieves articles posted to Usenet 
news groiq)s 
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Note that although Altavista and Goto are among those seaidi eagines 
used by Lcquick, Google, and MSN Search to generate search results, aU search 
engjne use different ranldng strategies for distmguis^ The 
preceding Hst is not intended to be exhaustive, and other search engines are or 
5 may become available that are suitable for the purposes of the Gene Search 
Assistant or sioiilar literature search module. 
2. Online Gmetic and Proteomic Database Resources 712: 

• Entrez Nucleotide Database: A searchable collection of nucleotide 
entries fiom GenBank. 

• Entrez Protein Database: The protein entries in the Entrez search 
and retrieval system have been compiled from a variety of sources, 
including SwissProt, PK^ PRF, PDB, and translations from 
annotated coding regions in GeoBank and Re£Seq. 

• Omim (Online Mendelian Inheritance in Man): This database is a 
catalog of human genes and genetic disorders authored and edited 
by Dr. Victor A. McKusick and his colleagues at Johns Hopkins and 
elsewhere, and developed for the World Wide Web by NCBI, the 
National Center for Biotechnology loformation. 
SRS6: SRS is a program developed at tibie European Bioinformatics 
Institute for the indexing and cross-referencing of databases of 
textual information. It provides unified access to molecular biology 
databases, integration of analysis tools and advanced parsing tools 
for disseminating and reformatting information stored in ASCII 
text. 

• ExPASy: The ExPASy (Expert Protein Analysis System) 
proteomics server of the Swiss lostitute of Bioinformatics (SIB), 
dedicated to the analysis of protein sequences and structures as well 
as2-DPAGE. 

DBCat: The public catalog of databases. 
3. Online Journals 714 
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* Entrez Pubmed: The National lostitute of Health's onlme portal to 
Medline, PreMedline, and other related databases of journal articles 
and abstracts. 

* HighWire: Stanford's search engine for searching across a wide 
5 cross section of peer-reviewed online journal sites. 

* BMJ: medical journal search engine at the British Medical Journal 
similar to HighWire. 

The central frame 704 displays the cxirrent functional section selected by 
the usei/analyst. The home/introductory page is initially displayed by default 
1 0 Following is a description of each functional section: 

1 . Home/hitroduction: The introductory page, which spears by default 
when the application is entered. This page contains a description of the 
^plication, and a login form. 
15 2. Register: Analysts who are pre-listed in a drop-menu on this page may 
register to use the application witii a chosen login name. A random 
password will be automatically generated and mailed to Ae analyst's 
email address. 

3. Define Search: Allows analysts to spedfy the gene-disease-organ triplet 
20 which win be the subject ofthe current Finding. If a finding by the 

analyst regarding the specified triplet akeady exists, it will be displayed 
along with all associated bibhpgraphic references. If a finding 
corresponding to the specified triplet does not exist, one will be created. 

4. View Findings: Referring to FIG. 10, the central firame is divided into 
25 two frames, a top firame 902in which infonnation about the current 

Finding is displayed, and a bottom fi^e 904 within which information 
about all bibliographic references created for that finding is displayed. 

a.) Top Frame 902: Wilhin the top ^mding*' frame, button 906 is 
provided which, when pressed, displays a form ^ch the 
30 analyst can use to modify Recurrent Finding. A second 
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button 908 within this frame refreshes the view. After 
modifying the Finding, if the analyst presses second button 
908, the modified Finding will be displayed in tiie top frame 
902. 

5 b.) Bottom Frame 904: Within the bottom '"BibRefe^fi^e, is 

button 910 which, when pressed, displays a form ^^ch flie 
analyst can use to create a new BibRef. This new BibRef will 
be associated with the current Finding. A second button 912 
within this frame refreshes the view. After rateringaBibRe:!^ 
10 if the analyst presses this second button, the new BibRef will 

be displayed in the bottom frame 904, along with all other 
BibRefe associated with the cmxent Finding. 
5- View Khase: This section contains a suite of options for searching the 
knowledge base. The analyst can search the knowledge base for either 
1 5 Bibliographic References or Finding by gene, disease, organ, keywords, 

analyst, or date. See FIG. 9. 

The knowledge base may be searched for Findings or Bibre&. The 
resulting records are displayed in the main viewing area of the browser. If 
an analyst searches for findings, a listing of finding summaries is 
20 displayed, each summary containing a link which when pressed displays 

the fiill finding. 

6. Log Out: This section contains a logout button which allows the analyst 
to end the current session. Upon logout, an information screen ^ears 
displaying the logm and logout time of the terminated session. 
25 The interactive functionality of the Gene Search Assistant is provided by a 

web server, a database server, and CGI scripts (Conamon Gateway Meiface 
programs) which allow interaction between the web pages displayed on the 
analyst's web browser, and the web and database servers. In a test embodiment 
of the Gene Search Assistant, the web server that was used was Apache, and the 
30 database server: MSQL. Botii servers are open code freeware when run off in a 
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UNIX eaviioninent. The CGI scripts for flie demo were written in PerlS, a 
prograouning/scripting language. 

Visualization The system of ttie present invention provides aids for visualization 
5 of both input and output data. In an exenoplary embodioment, scores and 
coefficients, e.g., values of gene expression coefticients, can be visualized by 
associating a color to the score values. Ranked lists of features can be 
visualized by printing the feature identifiers in the order of the ranked list The 
identifiers can then be colored according to the scores, wh^e each color is 

10 associated with a value according to the color or key. Ranked lists of 
subsets of features can be represented in the same way, with the features 
identifiers being repaced by the identifiers of all features of the subset. 

Ranked lists of features can also be visualized as a matrix of colored 
coefficients. The colunuis of tiie matrix represent all of the values a given feat^ 

15 takes across all patterns. The columns are ordered according to the feature 
ranking. The rows of the matrix may be ordered, for example, to group the 
examples of a same class together. A matrix can be transposed. One can also 
represent ranked lists of feature subsets, particularly equivalent features, in this 
way. 

20 Nested subsets of features with cardinality increments of one can be 

visualized by printing fiie feature identifiers in the order ttiat they are added to 
increase tiie cardinality of the feature subsets. The identifiers, or their 
background, can then be optionally colored according to the score of the subset 
containing all the features from the beginning of the list to that feature. For 

25 example, define an eight color map of colors 1-8, shown in FIG. 11a, where the 
different fill patterns indicate different colors. Assume that five features {fuf2y 
SsJaJs) form nested subsets {/}} c c c {fuUhJi^ c {/),/^, 

fsyfi.fs} wifli scores (1, 2, 4, 5, 8). Using an elimination order, the nested subsets 
can be represented by the combination of colors shown in FIG. 1 lb, where feature 

30 fu a singleton, is indicated by a box filled with color 1 (illustrated as low density 
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dots) to indicate the lowest score, and featureyj is filled indicated by a box fiUed 
with color 8 (illustrated as grid Imes) to indicate the highest score. 

Nested subsets of features with cardinality increments greater than or 
equal to one can be visualized as a Ust for which each feature belonging to a 
5 larger subset appears aft^ other features belonging to smaller subsets. (Note fliat 
there is no unique solution; alphabetical order can be used to choose among 
equivalwt solutions.) The subsets can be identified by bars, optionaUy colored 
according to the subset score. For example, assume eight features form nested 
subsets: {/}} c {/},/f} c {/}, c {/},A/j,/*,/5,^,/tf,/7} with scores (1, 
10 2, 4, 7). Using an elimination order, the nested subsets can be represented as 
shown in FIG. 12. The color m^ of FIG. 11a is appUed to the bars. As in the 
previous example, rq)resent5 the singleton at the .iight-most position on tiie 
diagram. 

This type of visualization is easily generalized to nested subsets of subsets 
15 of equivalent features. The feature labels are rq^laced by the labels of the subset 
of equivalent features of that of the cluster center. 

In a similar manner described for ranked Usts, nested subsets of features 
can also be represented as matrices of coefScients. The order of the columns that 
. represent all of the coefGcients of a given feature follows the addition (or 
20 elimination) order of the features in the nested subsets. 

Trees can be visualized by various tree visualization software. FIG. 13 
illustrates a screen shot of an interface generated by the "Gene Tree Explorer"' 
program implemented according to the present invention. The tree (which may 
also be referred to as a "gene observation graph") that is represented is shown in 
25 FIG. 14. While genes are represented in this example, other types of features can 
be similarly represented. The nodes of the trees are marked with their feature 
identifier, in this case, the Gene Accession Number (GAN) 1210, and shaded,or 
preferably colored, according to the score of the feature subset, which was 
obtained by walking fix>m the root node to that node. The left firame 1202 of the 
30 interface screen provides a link for exploring the tree by expanding the branches 
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as needed, showing that node and all the ancestor nodes. To access the e?q)ansion 
for a given node, the user clicks one of the anows 1206 next to the node, which 
then displays the corresponding subset in the upper right frame 1204 of tiie 
inter&ce screen, hi the upper rig^t frame, chcldng on a GAN 1210 results in 
5 retrieving additional information about that gene which was obtained from 
literature searches or other data sources. The retrieved information is then 
displayed in the bottom right frame 1208 of the interface screen. 

FIG. 14 illustrates the gene tree (observation gr^h) corresponding to the 
screen information in FIG. 11. This tree was generated from DNA microarray 

10 data of colon cancer and normal patients. Several runs using the RFE-SVM 
algorithm were used to generate alternative nested subsets of genes. The nodes 
are labeled with GANs. A path from the root node to a given node in the tree at 
depth D defines a subset of Z) genes. In this case, Z) = 4. The quality of every 
subset of genes can be assessed, for example, by the success rate of a classifier 

15 trained with these genes. The shading (color) of the last node of a given path 
indicates the quality of the subset In the present example, a scale of 64 shades, 
or colors, was used to maap the leave-one-out success rate. 

There are several possible ways to use the observation gr^h. For 
example, consider a diagnostic test design based on the dosage of a maximum of 

20 four proteins. The statistical analysis does not take into account which protein 
may be easier to dose as compared to another protein. The preferred 
•^unconstrained" path in the tree is indcated by the bold edges (darker connecting 
Unes) in the tree, from the root node to the leaf node. This path corresponds to 
running plain RFE-SVM. For exam,ple, imagme that by examining the first node 

25 (H64807), it is found that this gene corresponds to a protein that is impractical to 
dose. One can resort to the altemative protein (R55310) and then follow the 
remaming unconstrained path imdicated by the bold edges, or choose again the 
following gene according to given constraints. 

In this example, a binary tree of depth 4 is construed. This means that for 

30 every gene selection, only two alternatives are presented, and that iq> to four 
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genes can be selected. Wider trees (with more children at every node) penmt 
selection from a wider variety of genes. Deeper tree provide for selection of a 
larger number of genes. 

Jn the exemplary application to bioinformatics, and refming to FIG. 7, 
5 assume that the tree illustrated in FIG. 14 was generated from microarray data 
input into the data analysis engine 504 of the first module 500 refened to as &e 
"Gene Discovery Lab'*. A second module 550, referred to as the "Gene 
Knowledge Finder"', is used to search and processed input information 
comprising data extracted from biomedical literature, gena:ating a separate tree, 

10 i.e., a ^Tcnowledge graph", identifyrug, e.g., a plurality of genes associated with 
the disease of interest. The knowledge graph incorporates present human 
knowledge about the genes, derived proteins, etc. A simple gr^h could be a set 
fo weights corresponding to how easily proteins can be measured in serum. The 
two trees are integrated into a global combined graph, also refened to as a 

15 **product graph", using data analysis engine 520. This global combined gr^h 
provides multiple alternative candidate subsets of genes with a score attached to ' 
them, providing a tool to attach a combined cost to every gene subset considered. 
The cost combines the subset quality (from the statistical analysis) and how 
promising the subset is from the knowledge graph informatioiL The score reflects 

20 how predictive the genes are from a statistical perspective and how interesting 
they are from a biological perspective, providing valuable infonnation for 
purposes of drug design. 

This graph can be explored through a Web browser, with the knowledge 
base being built interactively while exploring the graph. A subset of genes that is 

25 optimimi, at least in some sense, is returned to the user (customer) over the 
Internet link. Alternatively, the customer can be provided with the software 
browser and the gene gr^h for his/her own e^qploratioru The gr^h may also be 
made available for exploration on a Web site. 

The graph construction is implemented using MatLab® (The MathWoiks, 

30 Inc., Natick, MA) using the following algorithm: 
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for i=^niodejiuin 

if (parent_of {i}= = 1) % root node 

forced in set == Q; 
else 

5 forced in set = gene id (ancestor_of {i}); 

force in set = forced inset (2: length (fozceJn_set)) % eliminate root 

node 
end 

forced out set = gene id ([older sibling of {[ancestor of (i) i]}]); 
10 if (-isempty (older_sibling of {i)) I i = = 2) ; % otherwise, no need to do 

anything 

IGS = rfe Comar', Xtrain, Ytrain, IG, forcedjn^set, forcejout_set, div_2, 
dbg logfile); 

% Get indices of first descendents desc-[i 
1 5 first descendents of (i)]; 

% Fill in corresponding gene numbers 
lg = length (desc) 

gn = iengfli (IGS); genejd (desc) = IGS (gn: - 
l:gn-lg+l; end 
20 end 
end 

The data mining platform of the present invention is capable of combining 
analysis and pre-existing knowledge firom a nmnber of diflferent data sources. 
25 While the above-disclosed embodiment describes the platform in terms of having 
two distinct modules, any number of modules may be provided for handling as 
many types of data as are relevant to the desired analysis. Furthermore^ although 
the disclosed embodiment relates to analysis of gene expression data and 
compares the results of that data analysis with information obtained from the 
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literature search and evaliiation, oliier disciplines wUl also benefit from the abiUty 
to combine heterogeneous data. 

It should be understood, of course, that the foregoing relates only to 
preferred embodiments of the present invention and that numerous modifications 
5 or alterations may be made therein without departing from the spirit and flie scope 
of the invention as set forth in the upended claims. Such alternate embodiments 
are considered to be encompassed within the sfpirit and scope of the preset 
invention. Accordingly, the scope of the present invention is described by the 
appended claims and is supported by tiie foregoing descriptioiL 
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Claims 

Whatisclaiinedis: 

1. A computer-implemeated data mining platform for generating an 
S output comprising knowledge fiom analysis of a plurality of heterogeneous data 

types, the platform conqirising: 

a plurahty of modules, each module adapted for processing one data type 
of the plurality of heterogeneous data types, each module comprising an input 
data source, a data analysis engine, a data ou^ut and a web server connection for 

10 each of the input data source, the data analysis engine and the data ou^ut; 

a web server connected to the web server connection for conunujoicating 
with each of the input data source, the data analysis engine and the data output 
and for providing means for monitoring one or more of the iiq)ut data source, the 
data analysis engine, and the data output; and 

15 a combined data analysis engine in conmiunication with the web server 

for combining the data output from the plurality of modules to generate a singile 
output representing results obtained from analyzing the plurality of 
heterogeneous data types. 

2. The data mining platform of claim 1, wherein the data analysis 
20 engine comprises a processor for executing at least one siqpport vector machine. 

3. The data mining platform of claim 2, wherein the processor frortfaer 
executes at least one pre-processing algorithm for pre-processing the input data, 

4. The data mining platform of claim 3, wherein pre-processing 
algorithm comprising a feature selection algorithm. 

25 5. The data mining platform of claim 4, wherein the feature selection 

algorithm comprises recursive feature elimination. . 

6. The data mining platform of any one of claims 2 througjb 5, 
wherein the data analysis engine includes a visualization tool for displaying the 
results. 



wo 02/103954 



69 



PCT/OS02/19202 



7. The data mining platform of any one of claims 1 through 6, 
wherein one data type comprises gene expression data and wherein the data 
analysis engine gen^ates a ranked list of gmes. 

8. The data mining platform of claim 7^ wherein the data analysis 

5 engine graerate a geat obs^ation tree for visually displaying the ranked list of 
gmes. 

9. The data mining platform of claim 7, wherein a second type of 
data comprises knowledge contained in published literature and wherein the 
combined data analysis engine validates the raiiked list of genes according to the 

1 0 knowledge contained in published literature. 
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